基于Counting Bloom Filter的海量网页快速去重研究  被引量:1

Research of massive web rapidly filter base on Counting Bloom Filter

在线阅读下载全文

作  者:吴家奇[1] 刘年国[1] 李雪[1] 谢翔 王涛 WU Jiaqi;LIU Nianguo;LI Xue;XIE Xiang;WANG Tao(State Grid Huainan Power Supply Company,Huainan 232007 Anhui,China)

机构地区:[1]国网淮南供电公司,安徽淮南232007

出  处:《电力大数据》2018年第12期37-42,共6页Power Systems and Big Data

摘  要:网页去重是从给定的大量的数据集合中检测出冗余的网页,然后将冗余的网页从该数据集合中去除的过程,可以有效地减少检索和存储的压力。其中基于同源网页的URL去重方法、基于网页结构和特征的抽取指纹方法和基于网页内容的聚类方法的研究都已经取得了很大的发展,但是针对海量网页去重问题,上述三种方法,目前还是很难解决网页去重的时间和空间问题,本文在基于MD5指纹库网页去重算法的基础上,结合Counting Bloom filter算法的特性,提出一个节省空间的大规模数据表示和快速去重策略,实现了一种快速去重算法IMP-CM Filter,大大降低了网页去重算法的时间复杂度和空间复杂度。该算法通过减少I/0频繁操作,来提高海量网页去重的效率。最后通过实验表明,IMP-CM Filter算法的有效性。Web deduplication is a process which detected duplicate content pages from a given amount of data collection, and then removed from the copy of the collection, It can effectively reduced the pressure of retrieval and storage. Which research of web deduplieation based on the URL filter,the structure and characteristics ,the contents and homology, has achieved great development, But it is no good solution to the problem of running time and stored space in the massive web pages filter. Based on web-based MD5 fingerprint deduplieation algorithm, and using Counting Bloom filter algorithm, this essay proposed a menthod that is space-saving large-scale data and fast de-duplication, and implemented a algorithm for rapidly deduplication called IMP-CM Filter, which could improve the efficiency of mass web pages filter by reducing the frequent operation of I/O. On the fact that the IMP-CMFilter algorithm had higher performance.

关 键 词:网页去重 MD5指纹库 COUNTING BLOOM filter IMP-CM Filter算法 

分 类 号:TM744[电气工程—电力系统及自动化]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象