不平衡数据集中采样比例对分类结果影响的研究  

A study on optimizing sampling ratios for improved classification results in imbalanced datasets

在线阅读下载全文

作  者:许思为 周明[1] 邹瑞 刘吉华[1] 吴俊平 秦雨露 XU Siwei;ZHOU Ming;ZOU Rui;LIU Jihua;WU Junping;QIN Yulu(School of Business,Hubei University,Wuhan 430062,China)

机构地区:[1]湖北大学商学院,武汉430062

出  处:《智能计算机与应用》2024年第9期111-117,共7页Intelligent Computer and Applications

摘  要:各领域的发展伴随着大量不同类别数据的产生,数据集样本类别往往存在不平衡的特点,特别是医疗、金融和工业领域的数据集,以往研究专注于采样的方法和分类算法。本文针对不平衡数据集的分类问题,按原始比例抽取验证数据集,对余下数据根据不同采样比例和重采样技术构建训练数据集,运用多种分类算法,研究不同采样比例对分类结果的影响。实验结果表明,当采样比例接近原始比例时,分类器的少数类精确率表现更好;当采样比例接近平衡比例时,少数类召回率表现更佳;而最佳F-Score值出现在原始比例和平衡比例之间。本文为不同的应用需求提供了参考,对少数类精确率要求比较高时,使用原始数据;对少数类召回率要求比较高时,通过采样,平衡数据集的不同类别。The development in various fields is accompanied by the generation of a large amount of diverse data,often exhibiting imbalances in sample class distribution.Previous research has primarily focused on sampling methods and classification algorithms to address the challenges of imbalanced datasets.In the context of classifying imbalanced datasets,this study involved extracting a validation dataset in proportion to the original distribution.The remaining data is used to construct training datasets with different sampling ratios,applying various classification algorithms to investigate the impact of these ratios on classification outcomes.Experimental results indicated that when the sampling ratio approaches the original distribution,classifiers demonstrate better precision for the minority class.Conversely,when the sampling ratio approaches a balanced distribution,superior recall for the minority class is observed.The optimal F-score value emerged between the original and balanced ratios.This study provided a insight for diverse application requirements:original data is recommended when demanding high precision for the minority class,while sampling to balance class distribution is suggested when prioritizing high recall for the minority class.

关 键 词:重采样 不平衡数据集 采样比例 召回率 精确率 

分 类 号:TP311.13[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象