检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:李婧惟 刘艳[1] 陆震 Li Jingwei;Liu Yan;Lu Zhen(Department of Health Statistics, Harbin Medical University, Harbin 150081, China)
机构地区:[1]哈尔滨医科大学卫生统计学教研室,黑龙江哈尔滨150081
出 处:《中国医院统计》2021年第1期91-96,共6页Chinese Journal of Hospital Statistics
基 金:黑龙江省自然科学基金项目(LH2019H005);哈尔滨医科大学研究生科研和实践创新基金资助项目(YJSKYCX2019-74HYD)。
摘 要:目的研究类不平衡是否会给基因表达数据的类别预测带来额外挑战,通过公开数据集评估7种分类器在不同类平衡比例数据上的表现,旨在为后续研究提供理论基础。方法在真实数据集上按不同比例抽取样本组成训练集(阴性样本量Nn=10,阳性样本量Np=10,15,20,30,35;Nn=15,Np=5,10,15,25,30)和测试集(Nn=20,Np=20),组成10组新数据集,并选取常用7种分类算法(SVM、C4.5、NB、RF、KNN、AdaBoost、Bagging)对10组新数据集进行分析,比较单次抽样分类与100次抽样平均的分类效果。结果随着数据集中阳性样本量的增加,分类算法整体灵敏度呈上升趋势,而特异度呈下降趋势。结肠癌数据集中,AdaBoost、NB和RF算法表现较好,支持向量机表现较差且不稳定。在白血病数据集中,NB算法整体表现最优且稳定,AdaBoost、C4.5和RF算法分类效果较好但波动较大。结论基因表达数据集中类平衡比例、数据特征和分类算法类别均会影响类别预测结果,且单次分析结果具有偶然性,复现性较差,故分析类不平衡数据时应结合类分布比例谨慎选择适当的算法。Objective To investigate if the class imbalance poses additional challenges when dealing with class prediction of gene expression data,to evaluate the performance of seven types of classifiers on class-imbalanced data on a publicly available data set,in order to provide a theoretical basis for subsequent research.Methods We obtained different levels of class imbalance by repeatedly randomly selecting subsets of the samples from the public data set:training sets(Nn=10 vs Np=10,15,20,30,35,and Nn=15 vs Np=5,10,15,25,30);the test sets were balanced(Nn=20 vs Np=20).We trained the classifiers on them and compare the classification effect of single sampling classification and 100 sampling average.Results As the number of positive samples in the data set increases,the overall sensitivity of the classification algorithm shows an upward trend,while the specificity shows a downward trend.In the colon cancer data set,AdaBoost,NB,and RF algorithms performed well,while support vector machines performed poorly and were unstable.In the leukemia data set,the overall performance of the NB algorithm is the best and stable,and the classification effect of AdaBoost,C4.5 and RF algorithms is better but fluctuates greatly.Conclusion Class imbalance,data characteristics and algorithm selection affect classification for high-dimensional data.And the results of a single analysis are contingent,with poor reproducibility.Researchers using class-imbalanced data should be careful in selecting algorithm and they should always use an appropriate method for dealing with the class imbalance problem.
分 类 号:R195.1[医药卫生—卫生统计学]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.22.42.249