基于FP-tree和MapReduce的集合相似度自连接算法被引量：1

A Set Similarity Self-Join Algorithm with FP-tree and MapReduce

作　　者：冯禹洪[1] 吴坤汉黄志鸿冯洋洲陈欢欢[2] 白鉴聪[1] 明仲[1] Feng Yuhong;Wu Kunhan;Huang Zhihong;Feng Yangzhou;Chen Huanhuan;Bai Jiancong;Ming Zhong(College of Computer Science and Software Engineering,Shenzhen University,Shenzhen,Guangdong 518060;School of Computer Science and Technology,University of Science and Technology of China,Hefei 230027)

机构地区：[1]深圳大学计算机与软件学院,广东深圳518060 [2]中国科学技术大学计算机科学与技术学院,合肥230027

出　　处：《计算机研究与发展》2023年第12期2890-2906,共17页Journal of Computer Research and Development

基　　金：国家自然科学基金项目(62272315,61836005,62002230);深圳市基础研究面上项目(JCYJ20210324093212034);广东省自然科学基金项目(2019A1515012053);腾讯“犀牛鸟”-深圳大学青年教师科研基金项目;深圳市稳定支持计划项目(20200814105901001)。

摘　　要：利用集合相似度自连接算法找出一个集合集中所有相似度大于给定阈值的集合对有着广泛的应用.基于过滤-验证框架和并行分布式计算框架MapReduce的集合相似度连接是近年来的研究热点.但现有算法在阈值低时产生较大规模的候选集,导致性能不理想.针对这一问题,提出采用频繁模式树FP-tree及其派生结构FP-tree*将数据压缩在内存中计算集合相似度自连接以减小候选集规模.首先设计并讨论基于现有FP-tree*的集合相似度连接计算及其优缺点,提出遍历效率更高的线性频繁模式树结构模型TELP-tree及基于它的算法TELP-SJ(TELP-tree self join),其包括分别面向构建树和遍历树的2阶段过滤算法,这些算法可以减小树规模和减少树遍历.然后,设计基于MapReduce的并行分布式算法FastTELP-SJ.最后,基于4组真实应用数据集进行3组性能比较实验.实验结果表明FastTELP-SJ算法面向高维大规模集合相似度自连接计算时,包括执行时间、内存占用率、磁盘使用量和可扩展性的运行效率最好.Given a collection of sets,the set similarity self-join(SSSJ)algorithm finds all pairs of sets whose similarity is higher than a given threshold,which has been widely used in various applications.The SSSJ adopting the filter-and-verification framework and the parallel and distributed computing framework MapReduce is an active research topic.But existing algorithms produce large candidate sets when the given threshold is low and lead to poor performance.To address this problem,we propose to compute the SSSJ with frequent pattern tree structures and their derivatives FP-tree*,which are used to compress data in memory for computation,targeting at reducing the size of candidate sets.First,based on an investigation on existing FP-tree*and the characteristics of SSSJ computation,we propose a more traversal efficient linear prefix tree structure,i.e.,TELP-tree,and an original SSSJ algorithm accordingly,i.e.,TELP-SJ,which includes a two-phase filtering strategy:tree-construction oriented filtering algorithm and tree-traversal oriented filtering strategy.These algorithms will reduce the size of TELP-trees and tree-traversals.Then,we design the parallel and distributed SSSJ computation algorithm FastTELP-SJ with MapReduce.Finally,4 groups of empirical comparison studies have been carried out on 4 sets of real application datasets.The experimental results demonstrate that FastTELP-SJ achieves better performance in term of the execution time,memory usage,disk usage and scalability for similarity joins over large scale collections of sets with high dimension.

关键词：相似度连接 FP树 MAPREDUCE框架 Jaccard函数集合

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于FP-tree和MapReduce的集合相似度自连接算法被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于FP-tree和MapReduce的集合相似度自连接算法 被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于FP-tree和MapReduce的集合相似度自连接算法被引量：1