基于机器学习的DNA序列分类研究  

Research on DNA Sequence Classification Based on Machine Learning

在线阅读下载全文

作  者:保志康 陈继璇 刘印晓 张茂源 章洪博 刘振安 魏晓娟[1] BAO Zhikang;CHEN Jixuan;LIU Yinxiao;ZHANG Maoyuan;ZHANG Hongbo;LIU Zhen'an;WEI Xiaojuan(College of Electrical Engineering,Northwest Minzu University,Lanzhou 730000,China)

机构地区:[1]西北民族大学电气工程学院,甘肃兰州730000

出  处:《生物化工》2024年第3期20-27,共8页Biological Chemical Engineering

基  金:国家自然科学基金项目(12205241);甘肃省自然科学基金项目(20JR10RA115);甘肃省高等学校创新基金项目(2022B-074);中央高校基本科研业务费专项资金资助(31920220049,31920230138)。

摘  要:DNA承载了生物体内的所有遗传信息,决定基因的结构和功能。对DNA所属类别进行预测,可以判断一个未知类是否为新物种、外来物种或者熟知物种。随着生物技术的发展,如何从获取到的DNA序列中提取完整信息并预测其序列组成,找到组成规律,准确反映物种特性成为生物信息学中的一个重要问题。本研究从NCBI网站上下载序列登录号为CP021707和CP085300的两类DNA序列文件,基于碱基频率和数量特征提取方法进行单碱基、双碱基和三碱基的特征提取,构建出84维、168维和35维特征向量,分别基于K近邻(K-Nearest Neighbor,KNN)、支持向量机(Support Vector Machine,SVM)以及K近邻和支持向量机融合(KNN-SVM)算法模型进行分类预测。实验结果表明,在168维特征向量下,基于KNN-SVM算法模型的分类准确率比基于KNN或SVM算法模型的分类准确率高,对判断一个未知类的相关特性具有积极意义。DNA carries all the genetic information in the organism,which determines the structure and function of the gene.Predicting the category of DNA can determine whether an unknown class is a new species,an alien species or a well-known species.With the development of biotechnology,how to extract complete information from the obtained DNA sequence and predict its sequence composition,find the composition rule,and accurately reflect the characteristics of the species has become an important issue in bioinformatics.In this study,two types of DNA sequence files with sequence registration numbers CP021707 and CP085300 are downloaded from the NCBI website.Based on the base frequency and quantitative feature extraction method,the feature extraction of single base,double base and triple base is carried out to construct 84-dimensional,168-dimensional and 35-dimensional feature vectors.Classification prediction is based on K-nearest neighbor(KNN),support vector machine(SVM)and K-nearest neighbor and support vector machine fusion(KNN-SVM)algorithm models respectively.The experimental results show that under the 168-dimensional feature vector,the classification accuracy based on KNN-SVM algorithm model is effectively improved compared with the classification accuracy based on KNN or SVM algorithm model,which is of positive significance for judging the relevant characteristics of an unknown class.

关 键 词:支持向量机 DNA序列 特征提取 K近邻 分类准确率 

分 类 号:TP181[自动化与计算机技术—控制理论与控制工程] Q751[自动化与计算机技术—控制科学与工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象