检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:杨欣 郭建彬 Yang Xin;Guo Jianbin(College of Management and Economics,Tianjin University,Tianjin 300072,China)
出 处:《甘肃科学学报》2019年第2期143-147,共5页Journal of Gansu Sciences
摘 要:基于百度百科对词语相似度计算进行研究,结合TF-IDF算法和词条百度百科内容,提出一种基于改进TF-IDF的百度百科词语相似度计算方法。TF-IDF算法对文本中词语权重进行计算时,面临部分代表性较好的词语权重较低的问题,通过引进百科词条中词语分布的类别信息,包括词语在类内、类间分布对词语权重的影响改进词语权重计算,此外,根据词语在全集中出现的频率定义词语的代表性,通过计算百科词条中词语的权重因子,构建词条的相关向量,根据向量之间余弦值计算词语相似度。实验表明,相对于不使用TF-IDF方法计算权重和基于经典TF-IDF方法计算权重,结合类别信息的TF-IDF方法和定义代表性的TF-IDF方法提高了词语相似度计算的准确性。The word similarity calculation is researched on the basis of Baidu Baike, and a kind of method of similarity calculation of Baidu and Baike terms based on the improved TF-IDF is proposed by combining the TF-IDF algorithm and entry content in Baidu and Baike. Part of better representative terms have lower weight upon calculation of the term weight in the article with TF-IDF algorithm. The calculation of term weight is improved for its impact by introducing the category information distributed in terms in Baike entry, including the inter-class and inter-class distribution of terms. Besides, the relevance vector of entry is constructed by calculating the weight factor of terms in Baike entry as per the representative of frequent definition terms in the universal set, and the word similarity is calculated as per the cosine power among vectors. The experiment indicates that the accuracy of word similarity calculation is improved by combining the TF-IDF method of category information and the representative TF-IDF method of definition with regard to the method for the weight calculation disusing the TF-IDF method and the weight calculation based on TF-IDF method.
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.117.82.179