检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:吴家奇[1] 刘年国[1] 李雪[1] 谢翔 王涛 WU Jiaqi;LIU Nianguo;LI Xue;XIE Xiang;WANG Tao(State Grid Huainan Power Supply Company,Huainan 232007 Anhui,China)
出 处:《电力大数据》2018年第12期37-42,共6页Power Systems and Big Data
摘 要:网页去重是从给定的大量的数据集合中检测出冗余的网页,然后将冗余的网页从该数据集合中去除的过程,可以有效地减少检索和存储的压力。其中基于同源网页的URL去重方法、基于网页结构和特征的抽取指纹方法和基于网页内容的聚类方法的研究都已经取得了很大的发展,但是针对海量网页去重问题,上述三种方法,目前还是很难解决网页去重的时间和空间问题,本文在基于MD5指纹库网页去重算法的基础上,结合Counting Bloom filter算法的特性,提出一个节省空间的大规模数据表示和快速去重策略,实现了一种快速去重算法IMP-CM Filter,大大降低了网页去重算法的时间复杂度和空间复杂度。该算法通过减少I/0频繁操作,来提高海量网页去重的效率。最后通过实验表明,IMP-CM Filter算法的有效性。Web deduplication is a process which detected duplicate content pages from a given amount of data collection, and then removed from the copy of the collection, It can effectively reduced the pressure of retrieval and storage. Which research of web deduplieation based on the URL filter,the structure and characteristics ,the contents and homology, has achieved great development, But it is no good solution to the problem of running time and stored space in the massive web pages filter. Based on web-based MD5 fingerprint deduplieation algorithm, and using Counting Bloom filter algorithm, this essay proposed a menthod that is space-saving large-scale data and fast de-duplication, and implemented a algorithm for rapidly deduplication called IMP-CM Filter, which could improve the efficiency of mass web pages filter by reducing the frequent operation of I/O. On the fact that the IMP-CMFilter algorithm had higher performance.
关 键 词:网页去重 MD5指纹库 COUNTING BLOOM filter IMP-CM Filter算法
分 类 号:TM744[电气工程—电力系统及自动化]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.219.90.165