Hadoop下改进布隆过滤器算法的网页去重被引量：1

An improved Bloom Filter algorithm under the Hadoop for duplicated web page removal

出　　处：《计算机工程与科学》2017年第2期285-290,共6页Computer Engineering & Science

基　　金：河北省自然科学基金(F2015402077);河北省重点基础研究项目(14964206D)

摘　　要：针对服务器中存储的大量重复和相似数据造成的空间浪费问题,改进的布隆过滤器(Bloom Filter)算法通过增加位数组并根据位数组的重复命中次数所计算的权重来动态优化重复数据的副本数,然后在Hadoop分布式集群下对改进的算法进行并行实现,以进一步提高作业处理效率。实验结果表明,与传统网页去重算法相比,改进的Bloom Filter算法的并行实现不仅提高了作业的处理效率,而且通过基于位数组下动态重复次数对副本数的优化,在一定程度上节省了服务器的存储空间。To solve the space waste problem existing in the server space where a lot of duplicated and similar data are stored, we propose an improved Bloom Filter algorithm, which adds an array of bit and dynamically optimizes the number of copies of duplicated data according to the weight calculated by the repeated hits of the bit array. Then, the improved algorithm is parallelized in the Hadoop distributed cluster to further improve the processing efficiency. Experimental results show that compared with traditional web duplicate removal algorithms, the improved Bloom filter algorithm can not only improve the processing efficiency of jobs, but also save the server storage space to a certain extent by dynamically optimizing the number of copies of duplicated data according to the repeated hits of the bit array.

关键词：HADOOP 布隆过滤器副本数 MAP REDUCE

分类号：TP301[自动化与计算机技术—计算机系统结构]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

Hadoop下改进布隆过滤器算法的网页去重被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

Hadoop下改进布隆过滤器算法的网页去重 被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

Hadoop下改进布隆过滤器算法的网页去重被引量：1