检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:杨楠[1] 李亚平[1] YANG Nan;LI Yaping(School of Information,Renmin University of China,Beijing 100872,China)
机构地区:[1]中国人民大学信息学院
出 处:《计算机应用》2019年第6期1701-1706,共6页journal of Computer Applications
基 金:国家自然科学基金资助项目(61773385)~~
摘 要:对于用户泛化和模糊的查询,将Web搜索引擎返回的列表内容聚类处理,便于用户有效查找感兴趣的内容。由于返回的列表由称为片段(snippet)的短文本组成,而传统的单词频率-逆文档频率(TF-IDF)特征选择模型不能适用于稀疏的短文本,使得聚类性能下降。一个有效的方法就是通过一个外部的知识库对短文本进行扩展。受到基于神经网络词表示方法的启发,提出了通过词嵌入技术的Word2Vec模型对短文本扩展,即采用Word2Vec模型的TopN个最相似的单词用于对片段(snippet)的扩展,扩展文档使得TF-IDF模型特征选择得到聚类性能的提高。同时考虑到通用性单词造成的噪声引入,对扩展文档的TF-IDF矩阵进行了词频权重修正。实验在两个公开数据集ODP239和SearchSnippets上完成,将所提方法和纯snippet无扩展的方法、基于Wordnet的特征扩展方法和基于Wikipedia的特征扩展方法进行了对比。实验结果表明,所提方法在聚类性能方面优于对比方法。Aiming at generalized or fuzzy queries, the content of the returned list of Web search engines is clustered to help users to find the desired information quickly. Generaly, the returned list consists of short texts called snippets carring few information which traditional Term Frequency-Inverse Document Frequency(TF-IDF) feature selection model is not suitable for, so the clustering performance is very low. An effective way to solve this problem is to extend snippets according to a external knowledge base. Inspired by neural network based word presenting method, a new snippet extension approach based on Word2 Vec model was proposed. In the model, TopN similar words in Word2 Vec model were used to extend snippets and the extended text was able to improve the clustering performance of TF-IDF feature selection. Meanwhile,in order to reduce the impact of noise caused by some common used terms, the term frequency weight in TF-IDF matrix of the extended text was modified. The experiments were conducted on two open datasets OPD239 and SearchSnippets to compare the proposed method with pure snippets, Wordnet based and Wikipedia based feature extensions. The experimental results show that the proposed method outperforms other comparative methods significantly in term of clustering effect.
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.229