基于样本密度峰值的不平衡数据欠抽样方法  被引量:7

Under-sampling method based on sample density peaks for imbalanced data

在线阅读下载全文

作  者:苏俊宁 叶东毅[1] SU Junning;YE Dongyi(College of Mathematics and Computer Science,Fuzhou University,Fuzhou Fujian 350108,China)

机构地区:[1]福州大学数学与计算机科学学院

出  处:《计算机应用》2020年第1期83-89,共7页journal of Computer Applications

基  金:国家自然科学基金资助项目(61672158);福建省高校产学合作项目(2018H6010)~~

摘  要:不平衡数据分类是数据挖掘和机器学习领域的一个重要问题,其中数据重抽样方法是影响分类准确率的一个重要因素。针对现有不平衡数据欠抽样方法不能很好地保持抽样样本与原有样本的分布一致的问题,提出一种基于样本密度峰值的不平衡数据欠抽样方法。首先,应用密度峰值聚类算法估计多数类样本聚成的不同类簇的中心区域和边界区域,进而根据样本所处类簇区域的局部密度和不同密度峰值的分布信息计算样本权重;然后,按照权重大小对多数类样本点进行欠抽样,使所抽取的多数类样本尽可能由类簇中心区域向边界区域逐步减少,在较好地反映原始数据分布的同时又可抑制噪声;最后,将抽取到的多数类样本与所有的少数类样本构成平衡数据集用于分类器的训练。多个数据集上的实验结果表明,与现有的RBBag、uNBBag和KAcBag等欠抽样方法相比,所提方法在F1-measure和G-mean指标上均取得一定的提升,是有效、可行的样本抽样方法。Imbalanced data classification is an important problem in data mining and machine learning. The way of re-sampling of data is crucial to the accuracy of classification. Concerning the problem that the existing under-sampling methods for imbalanced data cannot keep the distribution of sampling samples in good agreement with that of original samples, an under-sampling method based on sample density peaks was proposed. Firstly, the density peak clustering algorithm was applied to cluster samples of majority class and to estimate the central and boundary regions of different clusters obtained, so that each sample weight was determined according to the local density and different density peak distribution of cluster region where the sample was in. Then, the samples of majority class were under-sampled based on weights, so that the population of extracted majority class samples was gradually reduced from central region to boundary region of its cluster. In this way, the extracted samples would well reflect original sample distribution while suppressing the noise. Finally, a balanced data set was constructed by the sampled majority samples and all minority samples for the classifier training. The experimental results on multiple datasets show that the proposed sampling method has the F1-measure and G-mean improved, compared with some existing methods such as RBBag(Roughly Balanced Bagging), uNBBag(under-sampling NeighBorhood Bagging), KAcBag(K-means AdaCost bagging), proving that the proposed method is an effective and feasible sampling method.

关 键 词:不平衡数据 密度峰值 样本权重 欠抽样 集成学习 

分 类 号:TP301.6[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象