基于SVM的高维不平衡数据集分类算法  被引量:3

Classification algorithm of high-dimensional and imbalanced data based on support vector machine

在线阅读下载全文

作  者:赵小强[1,2,3] 张露[1] Zhao Xiaoqiang1,2,3 ,Zhang Lu1(1.College of Electrical and Information Engineering,Lanzhou University of Technology, Lanzhou, 730050, China ; 2.Key Laboratory of Gansu Advanced Control for Industrial Processes,Lanzbou, 730050,China; 3.National Demonstration Center for Experimental Electrical and Control Engineering Education, Lanzhou University of Technology, Lanzhou, 730050, Chin)

机构地区:[1]兰州理工大学电气工程与信息工程学院,兰州730050 [2]甘肃省工业过程先进控制重点实验室,兰州730050 [3]兰州理工大学电气与控制工程国家级实验教学示范中心,兰州730050

出  处:《南京大学学报(自然科学版)》2018年第2期452-461,共10页Journal of Nanjing University(Natural Science)

基  金:国家自然科学基金(61763029;61370037);甘肃省基础研究创新群体(1506RJIA031)

摘  要:由于数据量的不断增长,出现了大量的不平衡高维数据,传统的数据挖掘分类算法在处理这些数据时,易受到样本分布和维数的影响,存在分类性能不佳的问题.提出一种针对不平衡高维数据集的改进支持向量机(Supported Vector Machine,SVM)分类算法,首先通过核函数将数据集映射到特征空间中,再引入改进的核SMOTE(Kernel Synthetic Minority Over-sampling Technique)算法而得到正类样本,使两类样本数目平衡化;然后将维数高的数据集通过稀疏表示的方法投影到低维的空间中,实现降维;最后根据空间的距离关系来确定在输入空间中合成样本的原像,再对得到的平衡样本集通过SVM来分类,通过仿真实验验证了该算法对于高维不平衡数据集有较优的分类性能.High-dimensional data and imbalance data are very common in real life, but classification algorithms of traditional data mining have low classification performance due to the impacts of the sample distribution and dimensions. An improved Supported Vector Machine(SVM) classification algorithm for high-dimensional and imbalanced data is proposed in this paper. Firstly,the algorithm maps the original imbalanced dataset into feature space by kernel function, and homogeneous K-Nearest Neighborhood and heterogeneous K-Nearest Neighborhood of positive samples are seeking in feature space. Threshold value of adaptive neighbor is set according to interior distribution character of samples and the set of K-Nearest Neighborhood is obtained. The number of two kinds of samples is balanced. Then in feature space, sparse fractions are obtained by calculating features of training samples and are arranged according to numerical value. Feature selection based on sparse representation is applied to reduce the dimensionality of high dimensional dataset. Finally,these pre-images of the synthetic samples are found in input space by using distance relation between feature space and input space. The disposed dataset of balance sample is trained by SVM to classify. Experimental results show that the proposed algorithm can improve classification performance of high-dimensional and imbalanced dataset.

关 键 词:高维不平衡数据集 分类算法 支持向量机(SVM) 核SMOTE 稀疏表示 

分 类 号:TP274[自动化与计算机技术—检测技术与自动化装置]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象