基于改进TF-IDF的百度百科词语相似度计算被引量：6

Word Similarity Calculation of Baidu Baike Terms Based on the Improved TF-IDF

作　　者：杨欣郭建彬 Yang Xin;Guo Jianbin(College of Management and Economics,Tianjin University,Tianjin 300072,China)

出　　处：《甘肃科学学报》2019年第2期143-147,共5页Journal of Gansu Sciences

摘　　要：基于百度百科对词语相似度计算进行研究,结合TF-IDF算法和词条百度百科内容,提出一种基于改进TF-IDF的百度百科词语相似度计算方法。TF-IDF算法对文本中词语权重进行计算时,面临部分代表性较好的词语权重较低的问题,通过引进百科词条中词语分布的类别信息,包括词语在类内、类间分布对词语权重的影响改进词语权重计算,此外,根据词语在全集中出现的频率定义词语的代表性,通过计算百科词条中词语的权重因子,构建词条的相关向量,根据向量之间余弦值计算词语相似度。实验表明,相对于不使用TF-IDF方法计算权重和基于经典TF-IDF方法计算权重,结合类别信息的TF-IDF方法和定义代表性的TF-IDF方法提高了词语相似度计算的准确性。The word similarity calculation is researched on the basis of Baidu Baike, and a kind of method of similarity calculation of Baidu and Baike terms based on the improved TF-IDF is proposed by combining the TF-IDF algorithm and entry content in Baidu and Baike. Part of better representative terms have lower weight upon calculation of the term weight in the article with TF-IDF algorithm. The calculation of term weight is improved for its impact by introducing the category information distributed in terms in Baike entry, including the inter-class and inter-class distribution of terms. Besides, the relevance vector of entry is constructed by calculating the weight factor of terms in Baike entry as per the representative of frequent definition terms in the universal set, and the word similarity is calculated as per the cosine power among vectors. The experiment indicates that the accuracy of word similarity calculation is improved by combining the TF-IDF method of category information and the representative TF-IDF method of definition with regard to the method for the weight calculation disusing the TF-IDF method and the weight calculation based on TF-IDF method.

关键词：TF-IDF 百度百科词语相似度词语代表性

分类号：TP391.1[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于改进TF-IDF的百度百科词语相似度计算被引量：6

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于改进TF-IDF的百度百科词语相似度计算 被引量：6

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于改进TF-IDF的百度百科词语相似度计算被引量：6