检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]华北电力大学控制与计算机工程学院,北京102206
出 处:《计算机工程与设计》2017年第7期1839-1843,共5页Computer Engineering and Design
摘 要:为解决面向大规模高维数据的频繁项集挖掘问题,针对传统算法的时空复杂度和并行化策略进行优化,实现基于Spark改进的最大频繁项集挖掘算法。结合Spark的分布式框架和DMFIA算法的优点,提出深度路径搜索和长度优先超集检验两项改进方法,进行深度路径递归搜索一次性生成最大频繁项候选集,对候选项集进行长度优先排序并检验超集,降低候选项集的规模和挖掘次数,解决传统最大频繁项集挖掘算法在数据量大、维度高时效率低的问题。实验结果表明,该算法在时间上优于同类算法2-4倍,对数据集规模具有良好的扩展性。To solve the problem of mining frequent itemsets from data with large scale and high dimension,traditional algorithm was optimized from two aspects including time and space complexity and parallelization strategy.A refined algorithm was proposed based on Spark,combining the advantage of Spark distributed framework and DMFIA algorithm,with improvements by depth path search and length-first superset test.The reduction of efficiency in conventional maximum frequent data mining algorithms in large scale and high dimensional datasets was avoided,by utilizing depth-first search algorithm to generate maximum candidate frequent set,and sorting the acquired dataset by length and testing superset cyclically.Experimental results indicate that the proposed algorithm is 2-4 times faster than conventional algorithm and demonstrate its strong adaptability in different datasets of various scales.
关 键 词:频繁模式树 分布式计算 数据挖掘 关联规则 最大频繁项
分 类 号:TP311[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.14