概率潜在语义分析的KNN文本分类算法  被引量:3

KNN Text Classification Algorithm with Probabilistic Latent Semantic Analysis

在线阅读下载全文

作  者:戚后林 顾磊[1] 

机构地区:[1]南京邮电大学计算机学院,江苏南京210003

出  处:《计算机技术与发展》2017年第7期57-61,共5页Computer Technology and Development

基  金:国家自然科学基金资助项目(61302157)

摘  要:传统的KNN文本算法在计算文本之间的相似度时,只是做简单的概念匹配,没有考虑到训练集与测试集文本中词项携带的语义信息,因此在利用KNN分类器进行文本分类过程中有可能导致语义丢失,分类结果不准确。针对这种情况,提出了一种基于概率潜在主题模型的KNN文本分类算法。该算法预先使用概率主题模型对训练集文本进行文本-主题、主题-词项建模,将文本携带的语义信息映射到主题上的低维空间,把文本相似度用文本-主题、主题-词项的概率分布表示,对低维文本的语义信息利用KNN算法进行文本分类。实验结果表明,在训练较大的训练数据集和待分类数据集上,所提算法能够利用KNN分类器进行文本的语义分类,且能提高KNN分类的准确率和召回率以及F1值。Traditional KNN Text Classification (TC) algorithm just implements a simple concept matching during calculation of the simi- larity between texts without taking the semantic information of the text in training and test set into account. Thus it is possible to lose se- mantic meaning in the process of text classification with KNN classifier as well as inaccurate categorization results. Against this problem, a KNN text classification algorithm based on probabilistic latent topic model has been proposed, which establishes probabilistic topic mod- els of text-theme, theme-lexical item for training set texts beforehand to map the semantic information to low dimensional space of theme and dictates text similarity with probability distributions of text-theme and theme-lexical. The semantic information of low dimensional text can be classified with the proposed KNN algorithm. The experimental results show that in training of large training dataset and unclas- sified dataset,the proposed algorithm can conduct semantic classification of text with KNN classifier and enhance the accuracy and recall rate as well as F1 measure in KNN classification.

关 键 词:文本分类 KNN算法 文本表示模型 语义分类 概率潜在主题模型 

分 类 号:TP301.6[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象