基于聚类与特征融合的蛋白质亚细胞定位预测  被引量:4

Prediction of Protein Subcellular Localization Based on Clustering and Feature Fusion

在线阅读下载全文

作  者:王艺皓 丁洪伟[1] 李波[1] 保利勇[1] 张颖婕 WANG Yi-hao;DING Hong-wei;LI Bo;BAO Li-yong;ZHANG Ying-jie(School of Information Science and Engineering,Yunnan University,Kunming 650500,China)

机构地区:[1]云南大学信息学院,昆明650500

出  处:《计算机科学》2021年第3期206-213,共8页Computer Science

基  金:国家自然科学基金项目(61461053,61461054)。

摘  要:蛋白质亚细胞的定位预测不仅是研究蛋白质结构和功能的重要基础,还对了解某些疾病的发病机理、药物设计与发现具有重要意义。然而,如何利用机器学习精准预测蛋白质亚细胞的位置一直是一项具有挑战性的科学难题。针对这一问题,提出了一种基于聚类与特征融合的蛋白质亚细胞定位方法。首先将自相关系数法和熵密度法引入蛋白质特征表达模型的构建,并在传统的PseAAC(Pseudo-amino Acid Composition)的基础上提出了一种改进型PseAAC方法。为了更好地表达蛋白质序列信息,文中首先将自相关系数法、熵密度法和改进型PseAAC进行融合,构造了一种全新的蛋白质序列表征模型;然后利用主成分分析法对融合后的特征向量进行降维,将结果输入到LibD3C集成分类器,对蛋白质亚细胞进行分类预测,并采用留一法在Gram-positive和Gram-negative数据集上进行交叉检验;最后将取得的实验结果与其他现有算法进行比较。实验结果表明,所提方法在Gram-positive和Gram-negative数据集上分别取得了99.24%和95.33%的预测准确率,说明所提方法具有科学性和有效性。The prediction of protein subcellular location is not only an important basis for the study of protein structure and function,but also of great significance for understanding the pathogenesis of some diseases,drug design and discovery.However,how to use machine learning to accurately predict the location of protein subcellular has always been a challenging scientific problem.To solve this problem,this paper proposes a protein subcellular localization method based on clustering and feature fusion.Firstly,autocorrelation coefficient method and entropy density method are introduced into the construction of protein feature expression model,and an improved PseAAC(Pseudo-amino acid composition)method is proposed on the basis of traditional PseAAC.In order to express protein sequence information better,this paper fuses autocorrelation coefficient method,entropy density method and the improved PseAAC to construct a new protein sequence representation model.Secondly,we use principal component analysis(PCA)to reduce the dimension of the fused feature vector.Thirdly,we adopt the LibD3C ensemble classifier to classify and predict protein subcellular,and the prediction accuracy is evaluated by leave-one-out cross validation on Gram-positive and Gram-negative datasets.Finally,the experimental results are compared with other existing algorithms.The results show that the new method achieves the prediction accuracy of 99.24%and 95.33%on Gram-positive and Gram-negative datasets respectively,and the new method is scientific and effective.

关 键 词:特征融合 聚类 自相关系数 伪氨基酸组分法 主成分分析法 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象