基于统计信息聚类边界的不平衡数据分类方法  被引量:4

Unbalanced data classification method based on statistical information clustering boundary

在线阅读下载全文

作  者:李欣 俞卫琴 LI Xin;YU Wei-qin(College of Mathematics and Statistics,Shanghai University of Engineering Science,Shanghai 201620,China)

机构地区:[1]上海工程技术大学数理与统计学院,上海201620

出  处:《计算机工程与设计》2021年第8期2218-2223,共6页Computer Engineering and Design

基  金:国家自然科学基金项目(11602134、11772148);全国统计科学研究项目一般基金项目(2018LY16)。

摘  要:为解决不平衡数据在传统处理方法中容易出现数据的过拟合和欠拟合问题,提出基于统计信息聚类边界的不平衡数据分类方法。去除数据中噪声点,根据数据对象的k距离设定邻域半径,利用对象邻域范围内的k距离统计信息寻找边界点与非边界点;将少数类中的边界点作为样本,采用SMOTE算法进行过采样,对多数类采用基于距离的欠采样删除远离边界的点,得到平衡数集。通过实验结果对比,验证了该算法的G-mean值与F-value值都有提高。To solve the problems of overfitting and underfitting of data that are prone to occur in traditional processing methods for unbalanced data,an unbalanced data classification method based on statistical information clustering boundary was proposed.The noise points were removed in the data,and the neighborhood radius was set according to the k distance of the data object,and the k distance statistical information in the neighborhood of the object was used to find boundary points and non-boundary points.The boundary points in the minority class were used as samples for oversampling,and the distance-based undersampling was used to delete the points far away from the boundary for the majority class to obtain a balanced number set.The comparison of experimental results verifies that the G-mean and F-value of the algorithm have improved.

关 键 词:不平衡数据 聚类 边界点 非边界点 采样 

分 类 号:TP181[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象