基于样本分布的类别均衡化方法  

Label-balancing method based on sample distribution

在线阅读下载全文

作  者:李国和[1,2] 陈桂婷 郑艺峰 洪云峰 周晓明 潘雪玲 LI Guo-he;CHEN Gui-ting;ZHENG Yi-feng;HONG Yun-feng;ZHOU Xiao-ming;PAN Xue-ling(Beijing Key Lab of Petroleum Data Mining,China University of Petroleum-Beijing,Beijing 102249,China;College of Information Science and Engineering,China University of Petroleum-Beijing at Karamay,Karamay 834000,China;College of Computer Science,Minnan Normal University,Zhangzhou 363000,China;Application Research Institute,Hangzhou Shibei Intellectual Property Service Limited Company,Hangzhou 310010,China;Applied Research Institute,Xiamen Hanying Internet of Things,Xiamen 361021,China)

机构地区:[1]中国石油大学(北京)石油数据挖掘北京市重点实验室,北京102249 [2]中国石油大学(北京)克拉玛依信息科学与工程学院,新疆克拉玛依834000 [3]闽南师范大学计算机学院,福建漳州363000 [4]杭州拾贝知识产权服务有限公司应用研究院,浙江杭州310010 [5]厦门瀚影物联网应用研究院,福建厦门361021

出  处:《计算机工程与设计》2023年第9期2626-2633,共8页Computer Engineering and Design

基  金:国家自然科学基金项目(60473125,61701213);中国石油大学(北京)克拉玛依校区科研启动基金项目(RCYJ2016B-03-001);福建省自然科学基金项目(2021J011004,2021J011002)。

摘  要:为解决样本类别不均衡问题,提出基于样本分布的类别均衡化算法。采用单类支持向量机和近邻法学习多数类样本,净化类别不清的分布边界;采用密度聚簇算法对少数类样本聚簇,根据每个类簇的权重决定每个类簇生成的样本数,平衡类簇间的样本数量;根据每个簇的边界样本与非边界样本数量比值,确定每个样本权重,采用SMOTE合成少数类样本。采用UCI数据集实验对比和地震数据分析应用,验证了算法在不同分类模型均可提高分类精度。To address the problem of sample class imbalance,a class equalization algorithm based on sample distribution was proposed.A one-class support vector machine and nearest neighbor method was employed to learn the majority-class samples to purify the unclear distribution boundary.Density clustering algorithm was utilized to cluster minority-class samples,according to the weight of each class cluster,the number of samples generated by each class cluster was determined,and the distribution of inter-class was balanced.The weight of each sample was determined according to the ratio of the number of boundary samples to non-boundary samples,and SMOTE was adopted to synthesize the minority class samples.Experimental comparison in the UCI dataset and earthquake data analysis application demonstrates that the proposed algorithm can improve the classification accuracy of different classifiers,especially in imbalanced data.

关 键 词:不均衡数据 过采样 单类支持向量机 密度聚类 样本类别均衡化 样本分布 分类 

分 类 号:TP306.1[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象