基于搜索引擎的词汇语义相似度计算方法  被引量:21

Measuring Semantic Similarity between Words Using Web Search Engines

在线阅读下载全文

作  者:陈海燕[1] 

机构地区:[1]华东政法大学计算机科学与技术系,上海201620

出  处:《计算机科学》2015年第1期261-267,共7页Computer Science

基  金:国家社会科学基金项目(06BFX051);上海高校选拔培养优秀青年教师科研专项基金(hzf05046)资助

摘  要:词汇语义相似度的计算在网页浏览和查询推荐等网络相关工作中起着重要的作用。传统的基于分类的方法不能处理持续出现的新词。由于网络数据中隐藏着大量的噪音和冗余,鲁棒性和准确性仍然是一个挑战,因此提出了一种基于搜索引擎的词汇语义相似度计算方法。语义片段和检索结果的页数被用来去除词汇语义相似度计算过程中的噪音和冗余。此外,还提出了一种方法来整合查询结果页数、语义片段和显示的搜索结果的数量,该方法不需要任何先验知识与本体。实验结果显示,所提出的方法在Rubenstein-Goodenough测试集的相关系数为0.851,优于现有的基于网络的词汇语义相似度计算方法,同时在搜索引擎的查询扩展任务中具有较为良好的应用效果。Semantic similarity measures play important roles in many Web-related tasks such as Web browsing and query suggestion.Because taxonomy-based methods cannot deal with continually emerging words,recently Web-based methods have been proposed to solve this problem.Because of the noise and redundancy hidden in the Web data,robustness and accuracy are still challenges.We proposed a method integrating page counts and snippets returned by Web search engines.Then,the semantic snippets and the number of search results were used to remove noise and redundancy in the Web snippets.After that,a method integrating page counts,semantics snippets and the number of already displayed search results was proposed.The proposed method does not need any human annotated knowledge,and can be applied Web-related tasks easily.A correlation coefficient of 0.851 against Rubenstein-Goodenough benchmark dataset shows that the proposed method outperforms the existing Web-based methods by a wide margin.Moreover,the proposed semantic similarity measure significantly improves the quality of query suggestion against some page counts based methods.

关 键 词:语义相似度 信息检索 查询建议 网络检索 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象