基于Stacking集成算法的中医证候诊断模型建立——以肺癌为例  

Establishment of a Traditional Chinese Medicine Syndrome Diagnostic Model Based on Stacking EnsembleLearning:Take Lung Cancer as an Example

在线阅读下载全文

作  者:郭小川 冯贞贞[1,2] 刘文瑞 李建生 GUO Xiaochuan;FENG Zhenzhen;LIU Wenrui;LI Jiansheng(Coconstruction Cllaborative Innovation Center for Chinese Medicinead Respiratory Diseases by Henan and Educaton Ministry of P.R.China Henan Univerityof Chinse Medicine,Zhengzhou 450046;The Firt Affilated Hospitalf Henn Uniesityof Chinese Medicine;The First Clinical Medical College,Henan University of Chinese Medicine)

机构地区:[1]河南中医药大学/呼吸疾病中医药防治省部共建协同创新中心,河南省郑州市450046 [2]河南中医药大学第一附属医院 [3]河南中医药大学第一临床医学院

出  处:《中医杂志》2024年第17期1775-1783,共9页Journal of Traditional Chinese Medicine

基  金:国家中医药管理局中医药传承与创新“百千万”人才工程-岐黄工程首席科学家(国中医药人教函[2020]219号);国家自然科学基金(82205313);河南省中医药科学研究专项课题(2022JDZX102)。

摘  要:目的 探索Stacking集成算法优化中医证候诊断模型效能的方法。方法 以肺癌中医证候诊断模型的构建为例,将来自9家医院肺癌患者的2598例次临床症状及体征信息作为自变量(即特征变量),中医证候信息作为因变量,采用Python 3.7软件将临床数据以8∶2比例按照随机数字表法分为训练集和测试集。运用卡方检验、Spearman相关性检验、最小绝对值收缩和选择算子(LASSO)逻辑回归分析筛选肺癌中医证候的稳定特征;利用支持向量机(SVM)、K近邻算法(KNN)、随机森林(RF)、极端随机树(ExtraTrees)、极端梯度提升机(XGBoost)、轻量级梯度提升机(LightGBM)、自适应增强(AdaBoost)、梯度提升(GB)及多层神经网络(MLP) 9种机器学习算法进行训练,得到9种基础模型。在上述基础模型中筛选出性能表现较优的4种模型,运用Stacking集成算法进行融合形成融合模型,并通过上述9种机器学习算法对融合模型进行二次训练,运用准确率、微平均受试者工作特征(micro-average ROC)曲线、曲线下面积(AUC)和混淆矩阵指标进行评估,筛选最优诊断模型。结果 经数据处理得到稳定特征79个、中医证候13个。在基础模型训练中,RF、ExtraTrees、MLP及SVM基础模型综合性能表现较优,故将该4种模型的证候预测分布作为二次训练数据,并基于Stacking集成算法得到9种融合模型(SVM,KNN,RF,ExtraTree,XGBoost,LightGBM,GB,AdaBoost,MLP)。其中XGBoost融合模型性能表现最优,在训练集和测试集中准确率分别为0.850和0.838,过拟合差异为0.012,micro-average ROC曲线下面积(micro-average AUC)为0.996。所有融合模型在测试集中的准确率和micro-average AUC较基础模型均有改善。结论 以肺癌的中医证候数据为例,通过Stacking集成算法得出XGBoost融合模型在提升肺癌中医证候诊断效能方面具有显著优势。可见Stacking集成算法能整合多种模型算法的优点,有效提升中医证候诊断�Objective To explore the method of optimizing the performance of traditional Chinese medicine(TCM)syndrome diagnostic models using Stacking ensemble learning.MethodsTTaking the construction of TCM syndrome diagnostic model for lung cancer as an example,2598 cases of clinical symptoms and signs from lung cancer patients in 9 hospitals were used as independent variables(i.e.,feature variables),TCM syndrome information as dependent variables,and the clinical data were divided into training set and testing set in 8:2 ratio according to random number table method using Python 3.7 software.The stable features of TCM syndrome of lung cancer were screened using chi-square test,Spearman's correlation test,and Least Absolute Shrinkage and Selection Operator(LASSO)logistic regression analysis;nine machine learning algorithms are trained,including support vector machines(SVMs),k-nearest neighbors(KNN)algorithm,Random Forest(RF),Extremely Randomized Trees,Extreme Gradient Boosting(XGBoost),Lightweight Gradient Boosting(LightGBM),Adaptive Boosting(AdaBoost),Gradient Boosting(CB)and the multi-layer perceptron(MLP),to obtain 9 basic models.Four models with better performance were screened out from the above basic models and fused to form a fusion model by using the Stacking ensemble learning,and the fusion model was trained twice by the above nine machine learning algorithms and evaluated by accuracy rate,micro-average ROC curves,area under the curve(AUC),and confusion matrix metrics,to screen the optimal diagnostic model.Results After data processing,79 stable features and 13 TCM syndromes were obtained.In the basic model training,the comprehensive performance of RF,ExtraTrees,MLP and SVM basic models were better,so the predicted distributions of the syndromes of these four models were used as the secondary training data,and nine fusion models were obtained based on the Stacking ensemble learning(SVM,KNN,RF,ExtraTree,XGBoost,Light-GBM,GB,AdaBoost,MLP).Among them,the XGBoost fusion model performed the best,with an accuracy of 0.

关 键 词:中医证候诊断模型 肺癌 证候 机器学习 Stacking集成算法 

分 类 号:R273[医药卫生—中西医结合] TP181[医药卫生—中医肿瘤科]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象