基于Borderline-SMOTE算法与Stacking集成学习的前列腺肿瘤风险预测研究  被引量:2

Risk prediction study of prostate tumors based on Borderline-SMOTE algorithm and Stacking ensemble learning

在线阅读下载全文

作  者:熊思伟 刘玉琳[1,2] XIONG Siwei;LIU Yulin(School of Microelectronics and Data Science,Anhui University of Technology,Anhui Ma'anshan 243032,China;Anhui ProvincialJoint Key Laboratory of Disciplines for Industrial Big Data Analysis and Intelligent Decision,Anhui Ma'anshan 243032,China)

机构地区:[1]安徽工业大学微电子与数据科学学院,安徽马鞍山243032 [2]工业大数据分析与智能决策安徽省联合共建学科重点实验室,安徽马鞍山243032

出  处:《现代肿瘤医学》2023年第16期3075-3081,共7页Journal of Modern Oncology

基  金:安徽省教学研究项目(编号:2020jyxm0212);安徽省质量工程项目(编号:2021xxkc017);大学生创新创业训练计划项目(编号:202110360319)。

摘  要:目的:应用数据挖掘方法,建立高准确率的组合模型,对前列腺肿瘤患者的风险进行预测,为前列腺癌(prostate cancer,PCa)的预防和诊断提供参考。方法:选择在临床医学科学数据中心(301医院)进行前列腺穿刺活检的患者682例,运用互信息作为评价标准筛选出与PCa有关的特征属性;针对机器学习的XgBoost、Logistic回归、Adaboost、K近邻和随机森林算法构建单一模型,应用5折交叉验证算法筛选出预测能力较优的3种模型;使用过采样处理,构建基于Borderline-SMOTE的单一模型及构建基于Borderline-SMOTE的Stacking组合模型并探究不同组合方式的影响;最后选择301医院与芜湖弋矶山医院的37例临床病例作为外部验证集对模型进行检验。结果:通过互信息筛选出19个关键特征属性;在单一模型的研究中发现随机森林模型、XgBoost模型以及AdaBoost模型这3种模型表现较优;而基于Borderline-SMOTE的单一模型使得标签属性趋于平衡,AUC值有大幅提升;构建的3种基于Borderline-SMOTE的Stacking组合模型中以XgBoost、随机森林为初级分类器,AdaBoost为次级分类器的组合模型预测能力最好,其准确率为0.9454,召回率为0.9375,精确度为0.9573,F_(1)分数为0.9470,AUC高达0.9823,并且该组合模型在临床验证集上的预测也有较好效果。结论:Borderline-SMOTE过采样处理不平衡数据集十分有效,相较于单一模型的预测,基于多模型融合的Stacking集成学习方式的PCa风险预测方法有着更高的预测精度和良好的推广性能,更有助于PCa的临床诊断。Objective:A combination model with high accuracy has been established by applying data mining method to predict the risk of patients with prostate cancer,which provides reference for the prevention and diagnosis of[HJ1.45mm]prostate cancer(PCa).Methods:A total of 682 patients who underwent prostate biopsy in the Clinical Medical Science Data Center(301 Hospital)were selected.Mutual information was used as evaluation criteria to screen out the characteristic attributes related to PCa.A single model was constructed based on XgBoost,Logistic regression,Adaboost,K-nearest neighbor and Random Forests algorithm of machine learning,and three models with better predictive ability were selected by using the five-fold cross-validation algorithm.The study used oversampling to construct the single model based on Borderline-SMOTE and the Stacking combination model based on Borderline-SMOTE,then explored the influence of different combination methods.Finally,37 clinical cases from 301 Hospital and Wuhu Yijishan Hospital were selected as external validation set to test the model.Results:19 key feature attributes were screened by mutual information.It was found that random forest model,XgBoost model and AdaBoost model performed better in the study of a single model.And the single models based on Borderline-SMOTE made the label attributes balance and gave a great increase of AUC.In the three constructed combination models by Borderline-SMOTE and Stacking,the one with XgBoost,Random Forests as the primary classifier and AdaBoost as the secondary classifier had the best prediction ability.Its accuracy was 0.9454.Recall was 0.9375,precision was 0.9573,F_(1) score was 0.9470,the AUC was as high as 0.9823,and the combined model also had a good prediction effect in the clinical validation set.Conclusion:Borderline-SMOTE oversampling treatment of the imbalance data set is very effective.Compared with the prediction of a single model,the PCa risk prediction method based on Stacking ensemble learning method of the multi-model fusion has h

关 键 词:前列腺肿瘤 互信息 Borderline-SMOTE Stacking集成学习 

分 类 号:R737.25[医药卫生—肿瘤]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象