基于编辑距离的网页去重策略

Near-replicas of Web Pages Detection Based on Levenshtein Distance

机构地区：[1]中国科学院声学研究所,北京100190 [2]中国科学院大学,北京100049

出　　处：《网络新媒体技术》2013年第6期1-7,共7页Network New Media Technology

基　　金：国家高技术研究发展计划(863计划)"十二五"计划项目课题(2012AA011102);国家语委"十二五"科研项目(YB125-53);中国科学院学部咨询项目(Y129091211)

摘　　要：互联网中存在着大量的重复网页,在进行信息检索或大规模网页采集时,网页去重是提高效率的关键之一。本文在研究"指纹"或特征码等网页去重算法的基础上,提出了一种基于编辑距离的网页去重算法,通过计算网页指纹序列的编辑距离得到网页之间的相似度。它克服了"指纹"或特征码这类算法没有兼顾网页正文结构的缺点,同时从网页内容和正文结构上进行比较,使得网页重复的判断更加准确。实验证明,该算法是有效的,去重的准确率和召回率都比较高。Many web pages are replicated in the internet. Finding the near- replicas of web pages has become the key to improve the efficiency of the information retrieval and web pages collection. This paper first presents existing near- replicas detection algorithms, including algorithms based on＂fingerprints＂or feature code. Then we propose a near- replicas detection algorithm based on Levensh- tein Distance,that is we obtain the amount of similarity between two web pages by computing Levenshtein Distance of two web page fin- gerprint sequences. This algorithm overcomes the shortcoming that algorithms based on＂fingerprints＂or feature code didn＇t take ac- count of the text structure of web pages,compares both the text content and structure of web pages and makes the near- replicas detec- tion of web pages more accurate. This algorithm has been proved to be effective by experiment,and both the precision and recall rate are high.

关键词：互联网网页去重指纹编辑距离

分类号：TP393.092[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于编辑距离的网页去重策略

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于编辑距离的网页去重策略

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索