基于DBSCAN聚类的改进KNN文本分类算法  被引量:5

An Improved KNN Text Categorization Algorithm Based on DBSCAN

在线阅读下载全文

作  者:苟和平[1] 景永霞[1] 冯百明[2] 李勇[2] 

机构地区:[1]琼台师范高等专科学校信息技术系,海口571100 [2]西北师范大学计算机科学与工程学院,兰州730070

出  处:《科学技术与工程》2013年第1期219-222,共4页Science Technology and Engineering

基  金:教育部科学技术研究重点项目(208148);琼台师范高等专科学校项目(qtkz201006)资助

摘  要:K最近邻算法(KNN)在分类时,需要计算待分类样本与训练样本集中每个样本之间的相似度。当训练样本过多时,计算代价大,分类效率降低。因此,提出一种基于DBSCAN聚类的改进算法。利用DBSCAN聚类消除训练样本的噪声数据。同时,对于核心样本集中的样本,根据其样本相似度阈值和密度进行样本裁剪,以缩减与待分类样本计算相似度的训练样本个数。实验表明此算法能够在保持基本分类能力不变的情况下,有效地降低分类计算量。In order to find k neighbors of classification, KNN algorithm needs to calculate the similarity be- tween the test sample and every training sample in sample space, with the increasing in the number of training sam- ples, the computational overhead becomes higher. Aiming at the problem of the KNN, an improved algorithm is proposed based on DBSCAN to reduce the number of training samples. The noisy data in sample space were re- duced with DBSCAN algorithm, furthermore, the part of highly similar samples in kernel set of training data were reduced according to the similarity threshold and density. It is shown that the improved method can reduce compu- tational overhead effectively.

关 键 词:K最近邻 文本分类 样本裁剪 

分 类 号:TP391.11[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象