相关性和相似度联合的癌症分类预测  被引量:2

Cancer Classification Prediction Model Based on Correlation and Similarity

在线阅读下载全文

作  者:张学扶 曾攀 金敏 ZHANG Xue-fu;ZENG Pan;JIN Min(College of Computer Science and Electronic Engineering,Hunan University,Changsha 410006,China)

机构地区:[1]湖南大学信息科学与工程学院

出  处:《计算机科学》2019年第7期300-307,共8页Computer Science

基  金:国家自然科学基金项目(61773157)资助

摘  要:基于经验型组织病理学的癌症诊断往往误诊率很高。从基因层次对癌症进行分析和研究是现阶段提高癌症分类预测精度的重要途径之一。生物学研究表明,同种癌症的关联基因有着共同的功能特点。基于此,文中提出相关性和相似度联合的癌症分类预测集成方法。首先,一方面,从统计学角度分析基因的差异化表达,利用互信息方法对基因表达谱数据进行相关性计算;另一方面,从生物机理上进行基因间的相似性分析,结合拓扑相似性和语义相似性分别对蛋白质互作网络和GO数据进行基因间的功能相似度计算。以上两者结合,即通过同时最大化目标集合的相关性和相似度筛选出特征基因集。然后,通过Bootstrap方法对数据集进行多样性采样,在前面所选特征基因集的基础上利用多种机器学习算法训练得到多个差异化较大的分类预测模型。最后,利用得到的多模型对测试样本进行分类预测,通过决策模型得到最终的分类结果。对GEO中4种不同癌症数据集进行分类预测研究,并将所提方法与最近的研究方法进行综合对比,结果所提方法在各数据集上的分类预测精度均提高5%左右,相比IG/SGA方法最高能达到10%的精度提升。实验结果表明,相关性和相似度联合的方法有效提高了癌症的分类预测精度,选择得到的特征基因有利于揭示生物学意义,且将多种算法优势互补,可解决单个分类算法适用范围受限的问题。Cancer diagnosis based on empirical histopathology often has a high rate of misdiagnosis.Analyzing and studying cancer from the gene level is one of the important ways to improve the accuracy of cancer classification prediction at this stage.Biological studies have shown that the related genes of the same kind of cancer share common functional characteristics.Based on this,this paper proposes an integrated method of correlation and similarity for cancer classification prediction:First,on the one hand,statistical analysis of differential expression of genes The use of mutual information methods to perform correlation calculations on gene expression profiles.On the other hand,the similarity analysis between genes was performed on the basis of biological mechanisms,and the protein interaction network and GO data were genetically performed based on topological similarity and semantic similarity,respectively.The functional similarity calculation between the two,the combination of the two,that is,the feature set is selected by simultaneously maximizing the relevance and similarity of the target set;then the diversity of the data set is sampled by Bootstrap method,and the selected feature set in the front Based on the above,we use multiple different machine learning algorithms to train a number of differently differentiated prediction models.Finally,the multiple models are used to classify the test samples and obtain the final classification results through the decision model.The classification prediction of four different cancer datasets in GEO was compared with the latest research methods,and the classification accuracy on each dataset was improved by about 5%,which is up to 10%higher than that of IG/SGA methods.Increased accuracy.The experimental results show that the method of combining relevance and similarity can effectively improve the accuracy of cancer classification prediction.Selecting the obtained characteristic genes is beneficial for revealing biological significance,and the advantages of multiple algorith

关 键 词:癌症分类 相关性 语义相似性 拓扑相似性 多样性采样 多算法多模型 

分 类 号:TP391.9[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象