一种基于特征库投影的文本分类算法  被引量:1

A text classification algorithm based on feature library projection

在线阅读下载全文

作  者:尹绍锋 郑蕙[2] 徐少华 荣辉桂[3] 张娜[3] 

机构地区:[1]湖南大学校园信息化建设与管理办公室,湖南长沙410082 [2]湖南商学院旅游管理学院,湖南长沙410205 [3]湖南大学信息工程与科学学院,湖南长沙410082

出  处:《中南大学学报(自然科学版)》2017年第7期1782-1789,共8页Journal of Central South University:Science and Technology

基  金:国家自然科学基金资助项目(61672221;61304184;61672156)~~

摘  要:基于KNN的主流文本分类策略适合样本容量较大的自动分类,但存在时间复杂度偏高、特征降维和样本剪裁易出现信息丢失等问题,本文提出一种基于特征库投影(FLP)的分类算法。该算法首先将所有训练样本的特征按照一定的权重策略构筑特征库,通过特征库保留所有样本特征信息;然后,通过投影函数,根据待分类样本的特征集合将每个分类的特征库映射为投影样本,通过计算新样本与各分类投影样本的相似度来完成分类。采用复旦大学国际数据库中心自然语言处理小组整理的语料库对所提出的分类算法进行验证,分小量训练文本和大量训练文本2个场景进行测试,并与基于聚类的KNN算法进行对比。实验结果表明:FLP分类算法不会丢失分类特征,分类精确度较高;分类效率与样本规模的增长不直接关联,时间复杂度低。Considering that KNN algorithm has some disadvantages such as high time complexity, feature reduction, sample clipping and information loss, a feature library projection(FLP) classification algorithm was proposed. Firstly, the algorithm reserved all the features and characteristics of the training sample weight in the feature library. The data in this library were changed into new projection samples through the projection functions. By calculating the similarity of the new sample with the projection samples, data classification could be achieved. Based on the text classification, the effectiveness of the algorithm and texts, the data were validated under two conditions, i.e. small training texts and large training texts, and it was compared with KNN algorithm. The results show that the FLP algorithm does not lose the classification feature, and the classification accuracy is higher than that of other ones. The classification efficiency is not directly related to the sample size growth, and the time complexity is low.

关 键 词:文本分类 KNN算法 特征库投影 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象