检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:冯禹洪[1] 吴坤汉 黄志鸿 冯洋洲 陈欢欢[2] 白鉴聪[1] 明仲[1] Feng Yuhong;Wu Kunhan;Huang Zhihong;Feng Yangzhou;Chen Huanhuan;Bai Jiancong;Ming Zhong(College of Computer Science and Software Engineering,Shenzhen University,Shenzhen,Guangdong 518060;School of Computer Science and Technology,University of Science and Technology of China,Hefei 230027)
机构地区:[1]深圳大学计算机与软件学院,广东深圳518060 [2]中国科学技术大学计算机科学与技术学院,合肥230027
出 处:《计算机研究与发展》2023年第12期2890-2906,共17页Journal of Computer Research and Development
基 金:国家自然科学基金项目(62272315,61836005,62002230);深圳市基础研究面上项目(JCYJ20210324093212034);广东省自然科学基金项目(2019A1515012053);腾讯“犀牛鸟”-深圳大学青年教师科研基金项目;深圳市稳定支持计划项目(20200814105901001)。
摘 要:利用集合相似度自连接算法找出一个集合集中所有相似度大于给定阈值的集合对有着广泛的应用.基于过滤-验证框架和并行分布式计算框架MapReduce的集合相似度连接是近年来的研究热点.但现有算法在阈值低时产生较大规模的候选集,导致性能不理想.针对这一问题,提出采用频繁模式树FP-tree及其派生结构FP-tree*将数据压缩在内存中计算集合相似度自连接以减小候选集规模.首先设计并讨论基于现有FP-tree*的集合相似度连接计算及其优缺点,提出遍历效率更高的线性频繁模式树结构模型TELP-tree及基于它的算法TELP-SJ(TELP-tree self join),其包括分别面向构建树和遍历树的2阶段过滤算法,这些算法可以减小树规模和减少树遍历.然后,设计基于MapReduce的并行分布式算法FastTELP-SJ.最后,基于4组真实应用数据集进行3组性能比较实验.实验结果表明FastTELP-SJ算法面向高维大规模集合相似度自连接计算时,包括执行时间、内存占用率、磁盘使用量和可扩展性的运行效率最好.Given a collection of sets,the set similarity self-join(SSSJ)algorithm finds all pairs of sets whose similarity is higher than a given threshold,which has been widely used in various applications.The SSSJ adopting the filter-and-verification framework and the parallel and distributed computing framework MapReduce is an active research topic.But existing algorithms produce large candidate sets when the given threshold is low and lead to poor performance.To address this problem,we propose to compute the SSSJ with frequent pattern tree structures and their derivatives FP-tree*,which are used to compress data in memory for computation,targeting at reducing the size of candidate sets.First,based on an investigation on existing FP-tree*and the characteristics of SSSJ computation,we propose a more traversal efficient linear prefix tree structure,i.e.,TELP-tree,and an original SSSJ algorithm accordingly,i.e.,TELP-SJ,which includes a two-phase filtering strategy:tree-construction oriented filtering algorithm and tree-traversal oriented filtering strategy.These algorithms will reduce the size of TELP-trees and tree-traversals.Then,we design the parallel and distributed SSSJ computation algorithm FastTELP-SJ with MapReduce.Finally,4 groups of empirical comparison studies have been carried out on 4 sets of real application datasets.The experimental results demonstrate that FastTELP-SJ achieves better performance in term of the execution time,memory usage,disk usage and scalability for similarity joins over large scale collections of sets with high dimension.
关 键 词:相似度连接 FP树 MAPREDUCE框架 Jaccard函数 集合
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.15.190.49