基于最大安全近邻与局部密度的自适应过采样方法  

Adaptive Oversampling Method Based on Maximum Safe Nearest Neighbor and Local Density

在线阅读下载全文

作  者:赵小强[1] 何嘉琦 ZHAO Xiaoqiang;HE Jiaqi(College of Electrical Engineering and Information Engineering,Lanzhou University of Technology,Lanzhou 730000,China)

机构地区:[1]兰州理工大学电气工程与信息工程学院,兰州730000

出  处:《电子与信息学报》2025年第4期1140-1149,共10页Journal of Electronics & Information Technology

基  金:国家自然科学基金(62263021);甘肃省高校产业支撑计划(2023CYZC-24)。

摘  要:针对不平衡数据过采样的过程中如何合成有效新样本的问题,该文提出一种基于最大安全近邻与局部密度的自适应过采样方法。该方法利用最大安全近邻和局部密度将少数类样本划分为安全样本、边界样本和离群点;在此基础上,通过组合加权设置样本的采样概率,使得靠近边界的“次边界样本”更容易被选择为根样本,并且自适应地调整K近邻的参数K,选择最优合成区域;针对离群点,采用超球面内的随机过采样策略,进一步增加少数类样本的多样性。最后,将所提方法与合成少数类过采样技术(SMOTE)、自适应合成采样方法(ADASYN)等6种过采样方法在13个公开数据集上进行实验分析,结果表明,所提方法相对于对比方法在F1分数(F1-score)指标上分别平均提高了6.9%,8.8%,8.2%,5.8%,7.2%和12.5%,在几何平均值(G-mean)指标上分别平均提高了3.0%,2.5%,3.0%,3.2%,5.3%和8.6%,证明所提方法可以有效解决不平衡数据分类问题。Objective Traditional classifiers tend to optimize overall accuracy when dealing with imbalanced data sets,often resulting in poor classification performance for minority class samples.Among the available strategies,oversampling methods are widely used due to their strong generalization ability.However,conventional oversampling techniques frequently generate new samples with high overlap rates and limited validity,particularly near decision boundaries.To address this issue,this study proposes an adaptive oversampling approach that selects sub-boundary samples—those located near the boundary samples—for sample generation.In addition,the nearest-neighbor parameter space is constrained to refine the synthetic sample region.This method improves the classifier’s performance when learning from imbalanced data sets.Methods This study first identifies the maximum safe like-neighbors of positive class samples and classifies these samples as either hazardous or safe.The local density of each sample is then calculated,and hazardous samples—those more difficult to classify—are further categorized as either boundary samples or outliers.To provide the classifier with more informative positive class samples,“sub-boundary points”are preferentially selected as root samples using a weighted composite factor.The K-value in the K-nearest neighbor algorithm is adaptively adjusted based on the maximum safe nearest neighbor of each sample to improve neighbor selection.Outliers are oversampled randomly within a hypersphere to generate new samples while minimizing increases in spatial complexity.Results and Discussions To evaluate the feasibility and generalization of the proposed method,Logistic Regression(LR)and Support Vector Machine(SVM)classifiers are employed as base classifiers.The range of the distance adjustment coefficient is first determined by comparing results across selected datasets(Table 3).Once the range is established,the effect of different weight adjustment coefficients on performance is assessed(Table 4

关 键 词:不平衡数据 过采样技术 最大安全近邻 次边界样本 

分 类 号:TN911[电子电信—通信与信息系统] TP274[电子电信—信息与通信工程] TP181[自动化与计算机技术—检测技术与自动化装置]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象