基于MapReduce的相似自连接新方法:过滤和内切圆算法

Novel MapReduce-Based Similarity Self-Join Method:Filter and In-Circle Algorithm

作　　者：鲍广慧张兆功[1] 李建中[1,2] 玄萍[1] Bao Guang hui;Zhang Zhao gong;Li Jian zhong;Xuan Ping(School of Computer Science and Technology,Heilongjiang University,Harbin 150080;School of Computer Science and Technology,HarbinInstitute of Technology,Harbin 150001)

机构地区：[1]黑龙江大学计算机科学与技术学院,哈尔滨150080 [2]哈尔滨工业大学计算机科学与技术学院,哈尔滨150001

出　　处：《计算机研究与发展》2016年第12期2847-2857,共11页Journal of Computer Research and Development

基　　金：国家"九七三"重点基础研究发展计划基金项目(2012CB316200);国家自然科学基金项目(61302139)~~

摘　　要：相似自连接是一个在很多应用领域中很重要的问题.对于海量数据集,MapReduce可以提供一个有效的分布式计算框架,相似自连接操作也同样可以应用在MapReduce框架下.但已有研究工作仍然存在不足,如对于聚集数据区域采用加细划分方法,目的是负载平衡,但不易实现.现有的算法不能有效地完成海量数据集的相似自连接操作.为此提出了2个新颖的基于MapReduce的相似自连接算法,其思想是采用坐标过滤技术,形成有效候选集,以及针对聚集区域采用六边形划分的内切圆算法.过虑技术是在等宽网格划分基础上,利用同一维坐标间的距离差与相似性约束阈值ε进行比较,可以明显地减少候选集的数量,也证明了六边形划分是所有正多边形全覆盖中最优的划分方法.实验结果表明:新方法比其他算法有更高的效率,提高效率80%以上,它能够有效地解决有聚集区域的海量数据集的相似自连接问题.Similarity self-join is a very important study in many applications.For the massive datasets,MapReduce can provide an effective distributed computing framework,inparticular,similarity self-join can be applied on the framework.There are still problems,such as fine partition method,are applied to cluster data area for load balancing,but it is not easy to implement.Existing algorithms can^t effectively accomplish similarity self-join operations for the massive data sets.In this paper,we propose two novel algorithms of similarity self-join on the MapReduce framework,and use coordinate filtering techniques to get the valid candi date sets and use the in-circle method on the hexagon-based partitiona rea.Those coordinate-filtering techniques are based one qual-widthgrid partition,and adopt the restriction that two p oints have more distances than two projective points in the same axis,and cand ropobviously some candi date set.We also proof that the hexagon-based partition is the best form in all normal partition.Our experimental results demonstrate that the novel method has anadvantage over the other join algo rithms for cluster data are awhichim proves efficiency over80%.The algorithm can effectively solve the problem of the similarity self-join for the massive data in cluster data area.

关键词：海量数据集过滤相似自连接数据划分 HADOOP平台 MapReduce编程模型

分类号：TP311.13[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于MapReduce的相似自连接新方法:过滤和内切圆算法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于MapReduce的相似自连接新方法:过滤和内切圆算法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索