基于专业搜索引擎的网页去重技术研究  

Research on De-Dalication Based on Topic-Specific Search Engine

在线阅读下载全文

作  者:张艳[1] 

机构地区:[1]南京陆军指挥学院,江苏南京210045

出  处:《软件导刊》2012年第4期138-141,共4页Software Guide

摘  要:针对专业搜索引擎的特点,对基于词频统计的网页去重算法进行了改进。改进后形成的基于专业搜索引擎的网页去重算法通过两步进行:首先,通过计算文档用词重叠度,判断文档中使用的专业关键词集合是否大致相同。第二步,在满足上一步判断基础上,进一步判断两篇文档在各专业关键词用词频率上是否相同。De-duplication algorithms can remove duplicated web pages for search engine. This not only helps to improve the efficiency of information retrieval. But also can evidently save the storage for the data.This paper mainly studies the problem of removing duplicated web pages for search engine, and presents a comparison of existing duplicated detection algorithms. Finally, based on these, improved the existed algorithm, get a new algorithm based on frequency of the word. In addition, we also compare our method with two of the most popular copy detection mechanisms. Our method has been successfully adopted to remove the near replicas of web pages.

关 键 词:网页去重 专业搜索引擎 关键词特征向量 词频统计 

分 类 号:TP393[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象