检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
出 处:《扬州大学学报(自然科学版)》2016年第4期50-53,共4页Journal of Yangzhou University:Natural Science Edition
基 金:国家自然科学基金资助项目(61301220);江苏省"六大人才高峰"第七批高层次人才项目(2010-DZXX-149)
摘 要:为了提高长非编码RNA(long non-coding RNA,lncRNA)预测的准确性,提出一种基于随机森林算法的lncRNA预测方法.在国际通用的基因注释和基因组序列训练数据集中,首先进行特征选取,然后采用随机森林算法对包含特征信息的数据集进行模型训练.选取的特征包含14种三聚核酸序列(ACG、CCG、CGA、CGC、CGG、CGT、CTA、GCG、GGG、GTA、TAA、TAC、TAG、TCG)的占比、终止密码子在3种阅读框中的数量标准差、GC含量、蛋白质编码能力、转录本长度、外显子个数、平均外显子长度和保守性分值.10折交叉验证结果表明,该预测方法在真阳性率、精确率、召回率、F值和AUC值等性能指标方面均优于其他算法.To improve the accuracy of long non-coding RNA (lncRNA) prediction, a method based on random forest is proposed. Dataset for model training is derived from worldwide generally used gene annotation and genome sequence. Features selected include ratios of 14 triple-nucleotide sequences (ACG, CCG, CGA, CGC, CGG, CGT, CTA, GCG, GGG, GTA, TAA, TAC, TAG, TCG) to the transcript length respectively, standard deviations of stop codon counts of three read- ing frames, GC content, protein-coding potential (CDS, CDS length and ratio of CDS to tran- script), transcript length, exon count, average exon length, conservation score (average PhastCons score of transcript). Then the random forest algorithm is applied to the dataset for model training, and the over-fitting problem is solved during the realization of other algorithms. Results of 10-fold cross-validation manifest that the lncRNA prediction method based on random forest performs better than other methods including K-nearest neighbors (K-NN), Naive Bayes and Bayesian net- work in terms of true positive rate, precision, recall, F score and AUC (area under curve).
分 类 号:TP391.4[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.190.207.23