检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:赵小强[1] 何嘉琦 ZHAO Xiaoqiang;HE Jiaqi(College of Electrical Engineering and Information Engineering,Lanzhou University of Technology,Lanzhou 730000,China)
机构地区:[1]兰州理工大学电气工程与信息工程学院,兰州730000
出 处:《电子与信息学报》2025年第4期1140-1149,共10页Journal of Electronics & Information Technology
基 金:国家自然科学基金(62263021);甘肃省高校产业支撑计划(2023CYZC-24)。
摘 要:针对不平衡数据过采样的过程中如何合成有效新样本的问题,该文提出一种基于最大安全近邻与局部密度的自适应过采样方法。该方法利用最大安全近邻和局部密度将少数类样本划分为安全样本、边界样本和离群点;在此基础上,通过组合加权设置样本的采样概率,使得靠近边界的“次边界样本”更容易被选择为根样本,并且自适应地调整K近邻的参数K,选择最优合成区域;针对离群点,采用超球面内的随机过采样策略,进一步增加少数类样本的多样性。最后,将所提方法与合成少数类过采样技术(SMOTE)、自适应合成采样方法(ADASYN)等6种过采样方法在13个公开数据集上进行实验分析,结果表明,所提方法相对于对比方法在F1分数(F1-score)指标上分别平均提高了6.9%,8.8%,8.2%,5.8%,7.2%和12.5%,在几何平均值(G-mean)指标上分别平均提高了3.0%,2.5%,3.0%,3.2%,5.3%和8.6%,证明所提方法可以有效解决不平衡数据分类问题。Objective Traditional classifiers tend to optimize overall accuracy when dealing with imbalanced data sets,often resulting in poor classification performance for minority class samples.Among the available strategies,oversampling methods are widely used due to their strong generalization ability.However,conventional oversampling techniques frequently generate new samples with high overlap rates and limited validity,particularly near decision boundaries.To address this issue,this study proposes an adaptive oversampling approach that selects sub-boundary samples—those located near the boundary samples—for sample generation.In addition,the nearest-neighbor parameter space is constrained to refine the synthetic sample region.This method improves the classifier’s performance when learning from imbalanced data sets.Methods This study first identifies the maximum safe like-neighbors of positive class samples and classifies these samples as either hazardous or safe.The local density of each sample is then calculated,and hazardous samples—those more difficult to classify—are further categorized as either boundary samples or outliers.To provide the classifier with more informative positive class samples,“sub-boundary points”are preferentially selected as root samples using a weighted composite factor.The K-value in the K-nearest neighbor algorithm is adaptively adjusted based on the maximum safe nearest neighbor of each sample to improve neighbor selection.Outliers are oversampled randomly within a hypersphere to generate new samples while minimizing increases in spatial complexity.Results and Discussions To evaluate the feasibility and generalization of the proposed method,Logistic Regression(LR)and Support Vector Machine(SVM)classifiers are employed as base classifiers.The range of the distance adjustment coefficient is first determined by comparing results across selected datasets(Table 3).Once the range is established,the effect of different weight adjustment coefficients on performance is assessed(Table 4
关 键 词:不平衡数据 过采样技术 最大安全近邻 次边界样本
分 类 号:TN911[电子电信—通信与信息系统] TP274[电子电信—信息与通信工程] TP181[自动化与计算机技术—检测技术与自动化装置]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.222