基于峭度的分类样本优化  

Classification sample optimization based on kurtosis

在线阅读下载全文

作  者:王胜景 楚皓然 袁永生[1] WANG Shengjing;CHU Haoran;YUAN Yongsheng(College of Science,Hohai University,Nanjing 211100,China)

机构地区:[1]河海大学理学院,江苏南京211100

出  处:《现代电子技术》2023年第13期121-127,共7页Modern Electronics Technique

基  金:国家自然科学基金资助项目(11201116)。

摘  要:在机器学习领域,分类数据中的离群点与类簇主体在某些特征上有着显著不同的表现,从而干扰特征的类间区分性,使得分类效果不佳。目前很多研究工作集中于提高离群点识别精度,忽视离群点在模糊不同类簇特性的负面作用。文中提出剔除分类样本中离群点以提高分类准确率的策略,根据类内实例离群程度与实例之间相似度的统计分布关系,利用峭度对偏差敏感的统计学性质,构建峭度离群因子(KOF)指标衡量样本离群度。通过计算数据集中每个实例的KOF值,根据KOF梯度变化寻找离群突变点,结合3σ原则识别、剔除离群实例,优化分类数据集。采用K近邻、支持向量机、随机森林等3个经典分类器,在经典UCI数据集、电力负荷数据集和点云数据集等15个数据集上进行优化前后的对比实验,实验结果表明所提策略能够有效地改进分类效果,同时也减少了计算量。In the field of machine learning,outliers and cluster entities in classification data exhibit significant differences in certain features,which can interfere with the inter class distinctiveness of features and result in poor classification perfor⁃mance.At present,many researches focus on improving the identification accuracy of outliers and ignore the negative effects of outliers on the characteristics of fuzzy clusters with different classes.The strategy of eliminating outliers in classification samples is proposed to improve the classification accuracy.According to the statistical distribution relationship between outliers degree and similarity degree of instances within the class,the kurtosis outlier factor(KOF)is constructed to measure sample outliers de⁃gree by using the statistical property that kurtosis is sensitive to deviation.By calculating the KOF value of each instance in the data set,the outlier mutation point is found according to the gradient change of KOF,and the outlier instances are identified and eliminated according to the 3σprinciple to optimize the classified data set.Three classical classifiers,including K⁃nearest neighbor,support vector machine and random forest,are used to carry out comparative experiments before and after optimization on 15 data sets,such as classical UCI data sets,power load data sets and point cloud data sets.The experimental results show that the proposed strategy can effectively improve the classification effect and reduce the amount of computation.

关 键 词:峭度指标 样本优化 离群点 类内样本相似度 梯度变化 多分类问题 监督学习 

分 类 号:TN919-34[电子电信—通信与信息系统] TP301[电子电信—信息与通信工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象