机构地区:[1]西南医科大学公共卫生学院,泸州646000 [2]西南医科大学科技处,泸州646000
出 处:《西南医科大学学报》2023年第4期330-335,共6页Journal of Southwest Medical University
基 金:全国统计科学研究项目(2021LZ31);泸州市政府-西南医科大学公共卫生学院创新团队项目(SPH18001)。
摘 要:目的采用logistic回归、决策树和Lagrangian支持向量机(Lagrangian Sopport Vector Machine,LSVM)三种方法构建前列腺癌的早期诊断预测模型,并比较三种模型的预测效能,为前列腺癌的早期诊断提供理论支持。方法数据来源于国家临床医学科学数据中心(301医院)的《前列腺肿瘤预警数据集》,将清洗整理后的数据按7:3的比例随机分成训练集和测试集,基于训练集数据采用单因素Logistic回归筛选前列腺癌的关联因素,并建立多因素Logistic回归分析、LSVM和随机森林模型三个前列腺癌早期诊断预测模型,用测试集数据验证三个模型的预测准确性并用ROC曲线对三种模型进行评价比较。结果单因素Logistic分析筛选出13项具有统计学意义的指标,包括年龄、肌酸激酶同工酶、甘油三酯、磷脂、游离PSA、总PSA、钙、血清尿酸、载脂蛋白A1、载脂蛋白B、载脂蛋白C2、载脂蛋白C3、载脂蛋白E。多因素Logistic分析筛选出4个有统计学意义的变量:年龄、肌酸激酶同工酶、游离PSA、总PSA。LSVM模型筛选出10项预测因子,按重要性由高至低分别是:总PSA、年龄、载脂蛋白A1、磷脂、载脂蛋白B、甘油三酯、血清尿酸、游离PSA、肌酸磷酸同工酶、载脂蛋白E。随机森林模型筛选出10项预测因子,按重要程度排序由高至低依次为:载脂蛋白C3、磷脂、游离PSA、载脂蛋白B、载脂蛋白E、钙、血清尿酸、载脂蛋白A1、载脂蛋白C2、肌酸激酶同工酶。多因素非条件Logistic回归、LSVM模型和随机森林模型分析的AUC分别为0.895(0.876,0.913)、0.918(0.902,0.934)、0.724(0.688,0.760)。结论LSVM模型预测效果最好,多因素Logistic回归模型预测效果尚可,随机森林的预测效果不佳。Objective Logistic regression,decision tree,and LSVM were used to construct a predictive model for early diagnosis of prostate cancer,and the predictive performance of the three models was compared to provide theoretical support for the early diagno⁃sis of prostate cancer.Methods The data were obtained from the prostate tumor data set of the national clinical medical science data center(301 Hospital).The cleaned data were randomly divided into a training set and testing set according to the proportion of 7:3.Based on the training set data,single factor unconditional Logistic regression was applied to screen the related factors of prostate cancer,and unconditional multivariate Logistic regression analysis and Lagrangian Support Vector Machine(LSVM)model and random forest model were established risk prediction models for prostate cancer.The testing data were used to verify the accuracy of the three models,and ROC curves were used to evaluate their performances.Results The results of single factor logistic analysis showed that 13 indica⁃tors,including age,creatine kinase isoenzyme,triglyceride,phospholipid,free PSA,total PSA,calcium,serum uric acid,apolipoprotein A1,apolipoprotein B,apolipoprotein C2,apolipoprotein C3 and apolipoprotein E,were statistically significant.Four variables were se⁃lected by multivariate Logistic analysis:age,creatine kinase isoenzyme,free-PSA,total-PSA.LSVM model screened out 10 predictive factors,according to the importance from high to low:total-PSA,age,apolipoprotein A1,phospholipid,apolipoprotein B,triglyceride,serum uric acid,free-PSA,creatine phosphate isoenzyme,and apolipoprotein E.The random forest model selected ten predictive fac⁃tors,with the order of importance being apolipoprotein C3,phospholipid,free-PSA,apolipoprotein B,apolipoprotein E,calcium,serum uric acid,apolipoprotein A1,apolipoprotein C2,and creatine kinase isoenzyme.The area under the curve(AUC)of multivariate uncon⁃ditional logistic regression,LSVM model,and random forest model were 0.895(0.876,0.913),0.
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...