一种基于KNN和随机仿射的边界样本合成过采样方法  

A borderline sample synthesis oversampling method based on KNN and random affine transformation

在线阅读下载全文

作  者:冷强奎 孙薛梓 孟祥福[1] LENG Qiangkui;SUN Xuezi;MENG Xiangfu(School of Electronic and Information Engineering,Liaoning Technical University,Huludao 125105,China)

机构地区:[1]辽宁工程技术大学电子与信息工程学院,辽宁葫芦岛125105

出  处:《智能系统学报》2025年第2期329-343,共15页CAAI Transactions on Intelligent Systems

基  金:国家自然科学基金青年项目(61602056);国家自然科学基金面上项目(61772249);辽宁省教育厅项目(JYTMS20230819);辽宁工程技术大学博士科研启动基金项目(21-1043).

摘  要:过采样是处理不平衡数据分类问题的有效策略。本文提出了一种基于K近邻(K-nearest neighbor,KNN)和随机仿射的边界样本合成过采样方法,用于改进现有过采样方法的种子样本选择阶段和合成样本生成阶段。首先,引入三近邻理论,建立样本间有效的内在近邻关系,并去除数据集中的噪声,以降低后续分类器的过拟合风险。其次,准确识别那些难以学习且包含丰富信息的少数类边界样本,并将其用作采样种子。最后,利用局部随机仿射代替线性插值机制,在原始数据的近似流形中均匀地生成合成样本。相比于传统过采样方法,本文方法能更充分挖掘数据集中的重要边界信息,从而为分类器提供更多辅助以改善其分类性能。在18个基准数据集上,与8种经典采样方法(结合4种不同分类器)进行了大量对比实验。结果表明,本文所提方法获得了更高的F1分数和几何均值(G-mean),可以更为有效地解决不平衡数据分类问题。此外,统计分析也证实该方法具有更高的弗里德曼排名(Friedman ranking)。Oversampling is a proven strategy for addressing imbalanced data classification challenges.This paper introduces a borderline sample synthesis oversampling method based on K-nearest neighbor(KNN)and random affine transformation to improve both the seed sample selection stage and synthetic sample generation stages of existing oversampling methods.Initially,the three nearest neighbor theory is applied to establish an effective intrinsic neighborhood relationship between samples and remove noise from the dataset.This step helps reduce the risk of overfitting by subsequent classifiers.Next,the minority-class borderline samples that are difficult to learn but contain rich information are accurately identified and treated as sampling seeds.Finally,the method replaces traditional linear interpolation with local random affine transformation,uniformly generating synthetic samples within the approximate manifold of the original data.Compared with traditional oversampling methods,the proposed method more effectively leverages important borderline information within datasets,thereby enhancing classifier performance.Extensive comparative experiments were conducted on 18 benchmark datasets,comparing the proposed method against 8 classic sampling methods,each combined with 4 different classifiers.The results show that this method achieves higher F1 scores and geometric means(G-mean),addressing the imbalanced data classification problem more effectively.Furthermore,statistical analysis confirms that the method has a higher Friedman ranking.

关 键 词:K近邻 线性插值 边界样本 自然分布 过采样 三近邻理论 随机仿射变换 不平衡分类 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象