检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:张艳[1]
出 处:《软件导刊》2012年第4期138-141,共4页Software Guide
摘 要:针对专业搜索引擎的特点,对基于词频统计的网页去重算法进行了改进。改进后形成的基于专业搜索引擎的网页去重算法通过两步进行:首先,通过计算文档用词重叠度,判断文档中使用的专业关键词集合是否大致相同。第二步,在满足上一步判断基础上,进一步判断两篇文档在各专业关键词用词频率上是否相同。De-duplication algorithms can remove duplicated web pages for search engine. This not only helps to improve the efficiency of information retrieval. But also can evidently save the storage for the data.This paper mainly studies the problem of removing duplicated web pages for search engine, and presents a comparison of existing duplicated detection algorithms. Finally, based on these, improved the existed algorithm, get a new algorithm based on frequency of the word. In addition, we also compare our method with two of the most popular copy detection mechanisms. Our method has been successfully adopted to remove the near replicas of web pages.
关 键 词:网页去重 专业搜索引擎 关键词特征向量 词频统计
分 类 号:TP393[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.117.158.108