应用随机森林和支持向量机对三阴性乳腺癌基因数据的降维和筛选  被引量:8

Dimensionality Reduction and Screening of Triple Negative Breast Cancer Related Genes Using Random Forest and Support Vector Machine

在线阅读下载全文

作  者:秦璞 郭志旺 郭维恒 张蕊 刘学慧 王立芹 Qin Pu;Guo Zhiwang;Guo Weiheng(Department of Epidemiology and Statistics,School of Public Health,Hebei Medical University(050017),Shijiazhuang)

机构地区:[1]河北医科大学公共卫生学院流行病与卫生统计学教研室,050017 [2]河北省环境与人群健康重点实验室 [3]河北医科大学公共卫生学院劳动卫生与环境卫生教研室

出  处:《中国卫生统计》2020年第3期389-394,共6页Chinese Journal of Health Statistics

摘  要:目的应用随机森林和支持向量机算法处理乳腺癌基因数据,筛选三阴性和非三阴性乳腺癌的差异基因,为临床应用提供更多的参考靶点。方法使用TCGA乳腺癌基因数据,通过t检验和随机森林进行降维处理,然后使用支持向量机、支持向量机递归特征消除法、随机森林进行变量重要性排序,将随机森林和支持向量机与向前变量选择法结合进行模型预测并完成最终变量筛选,通过Holdout验证评价模型效果。结果数据经t检验的FDR降维后剩余18702个基因,经随机森林降维后剩余6326个基因;对降维后经三种方法排序的数据建立预测模型,获得各模型约登指数等评价指标;对排序结果中靠前的基因进行文献搜索,发现大部分基因和三阴性乳腺癌的转移或者预后有关。结论针对高维基因表达数据进行变量选择,使用t检验的FDR进行降维、随机森林对变量进行排序筛选、支持向量机进行预测效果最佳;通过检索重要性排序靠前基因发现大多数与三阴性乳腺癌有关,但某些靠前基因与三阴性乳腺癌无文献研究,建议研究这些基因与三阴性乳腺癌的相关性。Objective Random forest and support vector machine algorithms were used to process the gene expression data of breast cancer.The differentially expressed genes of tri-negative breast cancer and non-tri-negative breast cancer were screened,providing more reference targets for clinical diagnosis.Methods Using TCGA breast cancer gene data,dimensionality reduction was carried out through t-test and random forest.Then,the importance of variables was ranked by support vector machine,support vector machine-recursive feature elimination and random forest.Random forest and support vector machine were combined with forward variable selection method to predict and complete the final variable selection,and the model effect was evaluated by cross-validation.Results There are 18702 genes remaining after dimension reduction by t-test and 6326 genes remaining after dimension reduction by random forest.The evaluation index of dimension reduction by t-test is better than that of random forest.After dimensionality reduction,using the importance ranking of random forest,the evaluation index of model is the best.The recall rate of support vector machine is much higher than that of random forest,and the prediction effect of model is good.Conclusion For variable selection of high-dimensional gene expression data,using FDR of t-test to reduce dimensionality,random forest to sort and select variables,and support vector machine to predict,the ultimate effect is the best.

关 键 词:高维转录组数据 随机森林 支持向量机 向前变量选择法 

分 类 号:R737.9[医药卫生—肿瘤] TP181[医药卫生—临床医学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象