基于随机森林算法识别基因间长非编码RNA  

Identification of large intergenic non-coding RNAs using random forest

在线阅读下载全文

作  者:徐炜娜 张广乐 李仕红[1] 陈园园[1] 李强[1] 杨涛[1] 许明敏 乔宁 张良云[1] XU Wei-na;ZHANG Guang-le;LI Shi-hong;CHEN Yuan-yuan;LI Qiang;YANG Tao;XU Ming-min;QIAO Ning;ZHANG Liang-yun(College of Science, Nanjing Agricultural University, Nanjing 210095, Jiangsu, China)

机构地区:[1]南京农业大学理学院,江苏南京210095

出  处:《山东大学学报(理学版)》2019年第3期85-92,101,共9页Journal of Shandong University(Natural Science)

基  金:国家自然科学基金资助项目(11571173; 11401311; 11601231)

摘  要:为了深入了解和探索lincRNA的调控机制,建立了lincRNA高效识别模型,有助于为后续研究提供数据源。依据最小自由能(minimum free energy, MFE)和信噪比(signal-noise ratio, SNR)等特征,并通过特征贡献度大小剔除冗余特征,构建随机森林(random forest, RF)分类模型,有效地识别lincRNAs。经检验,模型的灵敏度、特异性和精确度分别达到94.1%、93.2%和93.7%,高于现有PhyloCSF、LncRNA-ID和CPC方法的各项识别指标。模型在识别过程中表现出较好的鲁棒性,可准确识别lincRNA。A data source for understanding lincRNAs′ regulatory mechanisms by accurate identification is provided. With the features of minimum free energy and signal-noise ratio, we remove the redundant features by feature contribution. Thus, we develop a machine learning model(random forest) based on random forest algorithm to identify lincRNAs. After inspecting with the same experimental dataset, we prove that the sensitivity, specificity and accuracy of this new method have reached 94.1%, 93.2% and 93.7%, which are higher than the current identification index of the methods of PhyloCSF, LncRNA-ID and CPC. The method proposed in this paper shows better robustness and effective classification.

关 键 词:基因间长非编码RNA 随机森林算法 最小自由能 信噪比 

分 类 号:Q61[生物学—生物物理学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象