基于编辑距离的网页去重策略  

Near-replicas of Web Pages Detection Based on Levenshtein Distance

在线阅读下载全文

作  者:丁泽亚[1,2] 张全[1] 

机构地区:[1]中国科学院声学研究所,北京100190 [2]中国科学院大学,北京100049

出  处:《网络新媒体技术》2013年第6期1-7,共7页Network New Media Technology

基  金:国家高技术研究发展计划(863计划)"十二五"计划项目课题(2012AA011102);国家语委"十二五"科研项目(YB125-53);中国科学院学部咨询项目(Y129091211)

摘  要:互联网中存在着大量的重复网页,在进行信息检索或大规模网页采集时,网页去重是提高效率的关键之一。本文在研究"指纹"或特征码等网页去重算法的基础上,提出了一种基于编辑距离的网页去重算法,通过计算网页指纹序列的编辑距离得到网页之间的相似度。它克服了"指纹"或特征码这类算法没有兼顾网页正文结构的缺点,同时从网页内容和正文结构上进行比较,使得网页重复的判断更加准确。实验证明,该算法是有效的,去重的准确率和召回率都比较高。Many web pages are replicated in the internet. Finding the near- replicas of web pages has become the key to improve the efficiency of the information retrieval and web pages collection. This paper first presents existing near- replicas detection algorithms, including algorithms based on"fingerprints"or feature code. Then we propose a near- replicas detection algorithm based on Levensh- tein Distance,that is we obtain the amount of similarity between two web pages by computing Levenshtein Distance of two web page fin- gerprint sequences. This algorithm overcomes the shortcoming that algorithms based on"fingerprints"or feature code didn't take ac- count of the text structure of web pages,compares both the text content and structure of web pages and makes the near- replicas detec- tion of web pages more accurate. This algorithm has been proved to be effective by experiment,and both the precision and recall rate are high.

关 键 词:互联网 网页去重 指纹 编辑距离 

分 类 号:TP393.092[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象