检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:秦璞 郭志旺 郭维恒 张蕊 刘学慧 王立芹 Qin Pu;Guo Zhiwang;Guo Weiheng(Department of Epidemiology and Statistics,School of Public Health,Hebei Medical University(050017),Shijiazhuang)
机构地区:[1]河北医科大学公共卫生学院流行病与卫生统计学教研室,050017 [2]河北省环境与人群健康重点实验室 [3]河北医科大学公共卫生学院劳动卫生与环境卫生教研室
出 处:《中国卫生统计》2020年第3期389-394,共6页Chinese Journal of Health Statistics
摘 要:目的应用随机森林和支持向量机算法处理乳腺癌基因数据,筛选三阴性和非三阴性乳腺癌的差异基因,为临床应用提供更多的参考靶点。方法使用TCGA乳腺癌基因数据,通过t检验和随机森林进行降维处理,然后使用支持向量机、支持向量机递归特征消除法、随机森林进行变量重要性排序,将随机森林和支持向量机与向前变量选择法结合进行模型预测并完成最终变量筛选,通过Holdout验证评价模型效果。结果数据经t检验的FDR降维后剩余18702个基因,经随机森林降维后剩余6326个基因;对降维后经三种方法排序的数据建立预测模型,获得各模型约登指数等评价指标;对排序结果中靠前的基因进行文献搜索,发现大部分基因和三阴性乳腺癌的转移或者预后有关。结论针对高维基因表达数据进行变量选择,使用t检验的FDR进行降维、随机森林对变量进行排序筛选、支持向量机进行预测效果最佳;通过检索重要性排序靠前基因发现大多数与三阴性乳腺癌有关,但某些靠前基因与三阴性乳腺癌无文献研究,建议研究这些基因与三阴性乳腺癌的相关性。Objective Random forest and support vector machine algorithms were used to process the gene expression data of breast cancer.The differentially expressed genes of tri-negative breast cancer and non-tri-negative breast cancer were screened,providing more reference targets for clinical diagnosis.Methods Using TCGA breast cancer gene data,dimensionality reduction was carried out through t-test and random forest.Then,the importance of variables was ranked by support vector machine,support vector machine-recursive feature elimination and random forest.Random forest and support vector machine were combined with forward variable selection method to predict and complete the final variable selection,and the model effect was evaluated by cross-validation.Results There are 18702 genes remaining after dimension reduction by t-test and 6326 genes remaining after dimension reduction by random forest.The evaluation index of dimension reduction by t-test is better than that of random forest.After dimensionality reduction,using the importance ranking of random forest,the evaluation index of model is the best.The recall rate of support vector machine is much higher than that of random forest,and the prediction effect of model is good.Conclusion For variable selection of high-dimensional gene expression data,using FDR of t-test to reduce dimensionality,random forest to sort and select variables,and support vector machine to predict,the ultimate effect is the best.
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.117