面向分类模型学习的样本类别均衡化方法  被引量:1

SYNTHETIC METHOD OF LABEL-BALANCING SAMPLES FOR CLASSIFIER LEARNING

在线阅读下载全文

作  者:李国和[1,2] 刘顺欣 张予杰 郑艺峰 洪云峰 周晓明 Li Guohe;Liu Shunxin;Zhang Yujie;Zheng Yifeng;Hong Yunfeng;Zhou Xiaoming(Beijing Key Lab of Petroleum Data Mining,China University of Petroleum-Beijing,Beijing 102249,China;College of Information Science and Engineering,China University of Petroleum-Beijing,Beijing 102249,China;Oil&Gas Development of Talimu Oil Filed,Kuerle 841000,Xinjiang,China;China Anti-Infringement and Anti-Counterfeit Innovation Strategic Alliance,Hangzhou 310010,Zhejiang,China;Xiamen Hanying Internet of Things Application Research Institute,Xiamen 361021,Fujian,China)

机构地区:[1]中国石油大学(北京)石油数据挖掘北京市重点实验室,北京102249 [2]中国石油大学(北京)信息科学与工程学院,北京102249 [3]塔里木油田克拉油气开发部,新疆库尔勒841000 [4]中国反侵权假冒创新战略联盟,浙江杭州310010 [5]厦门瀚影物联网应用研究院,福建厦门361021

出  处:《计算机应用与软件》2022年第10期230-237,共8页Computer Applications and Software

基  金:国家自然科学基金项目(60473125);中国石油大学(北京)克拉玛依校区科研启动基金项目(RCYJ2016B-03-001);福建省自然科学基金项目(2018J01546,2019J01748)。

摘  要:过采样方法是解决数据类别不均衡的有效方法之一,现有的过采样方法容易使样本具有高相似性导致过拟合。针对该问题,提出一种基于高斯混合模型和Jensen-Shannon散度的过采样方法(GJ-RSMOTE)。利用高斯混合模型对少数类样本进行聚类,通过簇的稀疏度计算各簇的采样数量以及采用超球体插值方法扩大生成样本的范围,避免了生成样本过拟合,通过Jensen-Shannon散度控制最终生成样本的数量。实验结果表明,GJ-RSMOTE可实现样本类别均衡性,可有效提高分类模型的识别精度。The over-sampling approach is one of the effective methods to solve the imbalanced samples for classifier learning.However,the existing oversampling methods easily make the generated samples highly similar,which may cause the over-fitting.To solve the problem,this paper proposes an over-sampling method that combines the Gaussian mixture model and Jensen-Shannon divergence,called GJ-RSMOTE.This method utilized a Gaussian mixture model to cluster for the minority class samples and then calculated the number of sampling in each cluster according to the sparsity of the clusters.In addition,to avoid over-fitting,the GJ-RSMOTE utilized the hypersphere interpolation method to expand the range of generated samples.The Jensen-Shannon divergence was used to control the number of sampling.The experimental results show that the GJ-RSMOTE can achieve the balance of samples’label and improve its classification accuracy.

关 键 词:不均衡数据 过采样 高斯混合模型 Jensen-Shannon散度 

分 类 号:TP301.6[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象