基于刺突蛋白序列和机器学习方法预测冠状病毒宿主多分类  

Multi-Classification Prediction of Coronavirus Hosts Based on Spike Protein Sequences and Machine Learning Methods

在线阅读下载全文

作  者:赵健[1] 王治博 谢翟 张力[1] 刘宏生 ZHAO Jian;WANG Zhi-bo;XIE Di;ZHANG Li;LIU Hong-sheng(School of Life Sciences,Liaoning University,Shenyang 110036,China;School of Pharmaceutical Sciences,Liaoning University,Shenyang 110036,China)

机构地区:[1]辽宁大学生命科学院,辽宁沈阳110036 [2]辽宁大学药学院,辽宁沈阳110036

出  处:《辽宁大学学报(自然科学版)》2023年第4期312-317,共6页Journal of Liaoning University:Natural Sciences Edition

基  金:沈阳市中青年科技创新人才支持计划项目(RC210216);国家自然科学基金青年科学基金项目(82003655);辽宁省教育厅面上项目(LJKZ0088);辽宁省“兴辽英才计划”项目(XLYC2002045);辽宁省重点研发计划项目(2019JH2/10300041)。

摘  要:严重急性呼吸综合征冠状病毒2(SARS-COV-2)在2019年年底引起了新型冠状病毒肺炎(COVID-19)的全球大流行,冠状病毒跨物种传播到多种哺乳动物包括人类.因此,快速准确地预测冠状病毒宿主分类对于未来控制和防治流行病具有重要意义.本文从NCBI(National center for biotechnology information)病毒数据库收集刺突蛋白序列,使用CD-HIT软件去除重复数据得到3216条序列,将其按照宿主分类分为6种样本,按照收集时间排序后以8∶2比例划分为训练集和测试集,使用分布描述符(CTDD)以及自然语言模型Seq2Vec来编码刺突蛋白序列特征,应用多种机器学习方法训练预测分类模型,并进行模型评估.在预测人类宿主方面,Seq2Vec-GCNN作为最佳模型其准确率高达99.37%,而在预测其他宿主分类时CTDD-RF表现极佳,准确率分别为猪类95.82%,禽类95.96%,骆驼98.33%,蝙蝠92.06%,其他哺乳动物94.01%.结果表明,使用机器学习方法基于刺突蛋白序列构建预测冠状病毒宿主分类模型是切实有效的.Severe acute respiratory syndrome coronavirus 2(SARS-COV-2)caused a global pandemic of COVID-19 in late 2019,with the coronavirus jumping species to multiple mammals,including humans.Rapid and accurate prediction of coronavirus host classification is of great significance for future epidemic control and prevention.In this study,spike protein sequences were collected from the NCBI(National center for biotechnology information)virus database.Using CD-HIT software to remove repeated data,3216 sequences were obtained,which were divided into 6 samples according to host classification.Sorted by collection time,they were divided into training set and test set in 8∶2 ratio.Distribution descriptor(CTDD)and natural language model Seq2Vec were used to encode the characteristics of spike protein sequence.A variety of machine learning methods are used to train and evaluate predictive classification models.As the best model,Seq2Vec-GCNN has an accuracy of 99.37%in predicting human hosts,while CTDD-RF has an excellent performance in predicting other host classification,with an accuracy of 95.82%for swine,95.96%for avian,98.33%for camels,92.06%for bats and 94.01%for other mammals.The results show that it is practical and effective to use machine learning methods to construct predictive coronavirus host classification models based on spike protein sequences.

关 键 词:机器学习 冠状病毒 刺突蛋白 

分 类 号:TP302.1[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象