检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]华中师范大学计算机学院,武汉430079 [2]汉口学院计算机科学与技术学院,武汉430212
出 处:《计算机应用研究》2016年第2期375-377,383,共4页Application Research of Computers
摘 要:为了降低中文文本相似度计算方法的时间消耗、提高文本聚类的准确率,提出了一种PST_LDA(词性标注潜在狄利克雷模型)中文文本相似度计算方法。首先,对文本中的名词、动词和其他词进行词性标注;然后,分别对名词、动词和其他词建立相应的LDA主题模型;最后,按照一定的权重比例综合这三个主题模型,计算文本之间的相似度。由于考虑了不同词性的词集对文本相似度计算的贡献差异,利用文本的语义信息提高了文本聚类准确率。将分离后的三个词集的LDA建模过程并行化,减少建模的时间消耗,提高文本聚类速度。在TanCorp-12数据集分别用LDA和PST_LDA方法进行中文文本相似度计算模拟实验。实验结果显示,PST_LDA方法不仅减少了建模时间消耗,同时在聚类准确率上有一定的提高。This paper introduced a new text similarity algorithm, which was based on the PST_LDA (part-of-speech tagging LDA) , to reduce the time complexity of Chinese text similarity calculation and improve the accuracy of text clustering. The al- gorithm had three procedures. Firstly, it divided the words into noun set, verb set and last words set according to the part of speech. Secondly, applied a LDAimodel to each set. Finally, it combined three models according to the certain proportion and computed the distance of two texts by JS similarity distance. Due to different contribution from the sets, the text clustering re- sult has a better accuracy. The algorithm parallelized modeling the three words sets to the uncorrelated LDA models, which could accelerate the text clustering process. ~ The simulation compared the LDA method and the PST LDA method on the Tan- Corp-12 data set. The result shows that the PST_LDA method reduces the modeling time with the higher text clustering accura- cy.
关 键 词:词性标注 LDA模型 PST_LDA模型 文本相似度计算
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.238