基于HTML标记和长句提取的网页去重算法  被引量:2

Duplicate Web Page Elimination Based on HTML and Extraction of Long Sentence

在线阅读下载全文

作  者:刘四维[1] 章轶[1] 夏勇明[1] 钱松荣[1] 

机构地区:[1]复旦大学通信工程系

出  处:《微型电脑应用》2009年第8期30-32,5,共3页Microcomputer Applications

摘  要:提出了一种高效的算法来去除互联网上的重复网页。该算法利用HTML标记过滤网页中的干扰信息,然后提取出能表征一张网页的长句作为网页的特征。通过分析两张网页所共享长句的数量,来判断两张网页是否重复。该算法还利用红黑树对网页的长句进行索引,从而把网页去重过程转换为一个搜索长句的过程,减小了算法的时间复杂度。实验结果表明该算法能够高效,准确地去除重复的网页。We have developed an efficient algorithm to eliminate the duplicate web pages. This algorithm takes advantage of HTML tags to filter the noise of a page, and extracts those long sentences that can represent a page, as the features of the page. And we use the number of long sentences that shared by two pages, as the metric of duplication. This algorithm uses a red-black tree to index those long sentences, and changes the elimination process into a search process. So that it can reduce the running time. The result of our experiments shows that this algorithm can efficiently and correctly eliminate duplicate web pages.

关 键 词:网页去重 页面去杂 长句 红黑树 

分 类 号:TP393[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象