基于集成学习的乳腺癌生存预测研究  被引量:1

Breast cancer survival prediction based on ensemble learning

在线阅读下载全文

作  者:张继婕 覃庆洪[2] 刘雪萍[3] 王康权 魏薇 ZHANG Jijie;QIN Qinghong;LIU Xueping;WANG Kangquan;WEI Wei(College of Science,Guangxi University of Science and Technology,Liuzhou 545006,China;Affiliated Cancer Hospital,Guangxi Medical University,Nanning 530021,China;Medical School,Guangxi University of Science and Technology,Liuzhou 545005,China)

机构地区:[1]广西科技大学理学院,广西柳州545006 [2]广西医科大学附属肿瘤医院,广西南宁530021 [3]广西科技大学医学部,广西柳州545005

出  处:《广西科技大学学报》2022年第1期101-109,共9页Journal of Guangxi University of Science and Technology

基  金:广西自然科学基金项目(2019GXNSFAA245067)资助。

摘  要:为对乳腺癌5年生存状态进行预测并分析其影响因素,首先,选取SEER数据库中2004—2010年乳腺癌相关数据,对选取的特征进行数据预处理;其次,在数据层面上,对数据进行SMOTE上采样以解决数据类别不平衡问题;在算法层面上,比较LightGBM、CatBoost和GBDT这3个模型在预测乳腺癌5年生存状态上的优劣;最后,根据重要性对乳腺癌5年生存状态的影响因素进行排序,并通过SHAP值对影响因素进行解释分析。本文构建的乳腺癌5年生存状态预测模型比单一模型具有更好的性能,其准确率、AUC、召回率、精确度和F_(1)值分别为0.9060、0.8443、0.9837、0.9160和0.9487;发现乳腺癌5年生存状态与肿瘤大小、检出的淋巴结总数、淋巴结转移数、雌激素受体、孕激素受体、年龄等因素有较大关系。本预测模型选择出的重要性特征与目前的临床结果保持一致,能为临床预后预测提供一定的技术支持。The research is conducted to predict the 5-year survival status of breast cancer and analyze the influence factors.Firstly,the breast cancer related data from 2004—2010 were selected from the SEER database,and the selected featured data were preprocessed.Secondly,in terms of data,SMOTE algorithm was used to oversample the data to solve the imbalance of data categories;in terms of algorithm,the advantagess and disadvantages of lightgbm,catboost and gbc in predicting the 5-year survival status of breast cancer were compared.Finally,the influencing factors of breast cancer 5-year survival status were analyzed by SHAP value after ranking.The 5-year survival prediction model of breast cancer constructed in this paper has better performance than a single model.The accuracy rate,AUC,recall rate,precision rate and F_(1)-score are 0.9060,0.8443,0.9837,0.9160 and 0.9487 respectively;and it shows that the 5-year survival status of breast cancer is closely related to tumor size,examined lymph nodes,positive lymph nodes,ER status,PR status,and age.The model can provide prognosis prediction for the clinic with its excellent performance and the selected important features consistent with the current clinical results.

关 键 词:SEER数据库 乳腺癌 集成学习 预后预测 

分 类 号:TP181[自动化与计算机技术—控制理论与控制工程] R737.9[自动化与计算机技术—控制科学与工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象