基于百度百科的词语相似度计算  被引量:22

Word Similarity Measurement Based on BaiduBaike

在线阅读下载全文

作  者:詹志建[1] 梁丽娜[1] 杨小平[1] 

机构地区:[1]中国人民大学信息学院北京100872

出  处:《计算机科学》2013年第6期199-202,共4页Computer Science

基  金:国家自然科学基金(70871115)资助

摘  要:词语相似度计算是自然语言处理的关键技术之一,是一个被广泛研究的基础课题。传统的词语相似度量方法大多是基于语义知识和基于语料库统计的方法,即这两类方法需要具有层次关系组织的语义词典和大规模的语料库。提出了一种新的基于百度百科的词语相似度量方法,通过分析百度百科词条信息,从表征词条的解释内容方面综合分析词条相似度,并定义了词条间的相似度计算公式,通过计算部分之间的相似度得到整体的相似度。实验结果表明,与已有的相似度计算方法对比,提出的算法更加有效合理。Research on word similarity measurement has been popular not only in natural language processing but also in other basic research. Traditional word similarity measurements use semantic lexieal or large-scale corpus. We first discussed the background of the applications of word similarity measurement, such as information retrieval, information extraction, text classification, example-based machine translation, etc. Then two strategies of word similarity measure- ment were summarized:one is based on ontology or a semantic taxonomy, the other is based on large collocations of words in corpus. BaiduBaike,an online open encyclopedia, could be used not only as a corpus but also a knowledge re- souree with rich semantic information. Based on BaiduBaike with its rich semantic information and category graph, we proposed a new method to analyze and compute Chinese word similarity from four dimensions: the baike card, the eon- tent of word, the open classification of word and the correlation words. We used language-network to choose top key terms of content of word. Based on vector space mode (VSM) theory, we calculated the similarity between parts of words. We presented a new "multi-path searching" algorithm on BaiduBaike category graph. A comprehensive similarity measuring method based on the four parts was proposed. Experiment results show that the method has a good performane.

关 键 词:词语相似度 语言网络 百度百科 向量空间模型 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象