检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]南京邮电大学计算机学院,江苏南京210003
出 处:《计算机技术与发展》2017年第7期57-61,共5页Computer Technology and Development
基 金:国家自然科学基金资助项目(61302157)
摘 要:传统的KNN文本算法在计算文本之间的相似度时,只是做简单的概念匹配,没有考虑到训练集与测试集文本中词项携带的语义信息,因此在利用KNN分类器进行文本分类过程中有可能导致语义丢失,分类结果不准确。针对这种情况,提出了一种基于概率潜在主题模型的KNN文本分类算法。该算法预先使用概率主题模型对训练集文本进行文本-主题、主题-词项建模,将文本携带的语义信息映射到主题上的低维空间,把文本相似度用文本-主题、主题-词项的概率分布表示,对低维文本的语义信息利用KNN算法进行文本分类。实验结果表明,在训练较大的训练数据集和待分类数据集上,所提算法能够利用KNN分类器进行文本的语义分类,且能提高KNN分类的准确率和召回率以及F1值。Traditional KNN Text Classification (TC) algorithm just implements a simple concept matching during calculation of the simi- larity between texts without taking the semantic information of the text in training and test set into account. Thus it is possible to lose se- mantic meaning in the process of text classification with KNN classifier as well as inaccurate categorization results. Against this problem, a KNN text classification algorithm based on probabilistic latent topic model has been proposed, which establishes probabilistic topic mod- els of text-theme, theme-lexical item for training set texts beforehand to map the semantic information to low dimensional space of theme and dictates text similarity with probability distributions of text-theme and theme-lexical. The semantic information of low dimensional text can be classified with the proposed KNN algorithm. The experimental results show that in training of large training dataset and unclas- sified dataset,the proposed algorithm can conduct semantic classification of text with KNN classifier and enhance the accuracy and recall rate as well as F1 measure in KNN classification.
关 键 词:文本分类 KNN算法 文本表示模型 语义分类 概率潜在主题模型
分 类 号:TP301.6[自动化与计算机技术—计算机系统结构]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.134.253.166