检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:赵丽梅 黄小菊[1] 宫学庆[1] Zhao Limei;Huang Xiaoju;Gong Xueqing(School of Software Engineering,East China Normal University,Shanghai 200062,China)
出 处:《计算机应用与软件》2022年第8期44-50,共7页Computer Applications and Software
摘 要:Spark是基于Map/Reduce计算模型进行大规模数据处理的分布式系统,每个任务都会被分为很多Map处理和Reduce处理在各个节点上并行执行。Shuffle操作是用于连接Map处理和Reduce处理的桥梁。在对两个大表进行Join操作的过程中,如果两表Join列不完全匹配,Spark中现有的Join实现算法会对大量数据进行shuffle操作,严重影响执行效率。提出一种基于Semi Join思想的Join实现算法——Semi Sort Merge Join,通过对左表Join列数据所构建的HashMap对右表数据进行过滤,可以有效减少Shuffle操作过程中所需传输的数据量。算法分析和实验结果表明,对于Join列数据不完全匹配的大表间Join操作,该算法能有效减少Shuffle操作的开销,右表与左表匹配数据量越少,算法优化的效果越明显。Spark is a distributed system for large-scale data processing based on the Map/Reduce computing model.Each task will be divided into many Map processing and Reduce processing to be executed in parallel on each node.The Shuffle operation is a bridge for connecting Map processing and Reduce processing.During the join operation on two large tables,if the join columns of the two tables do not match exactly,the existing join implementation algorithm in Spark will perform a shuffle operation on a large amount of data,which seriously affects the execution efficiency.Based on the idea of Semi Join,this paper proposes a join implementation algorithm called Semi Sort Merge Join.By filtering the data in the right table through the HashMap constructed from the join data of the left table,it could effectively reduce the amount of data needed to be transferred during the shuffle operation.Algorithm analysis and experimental results show that the algorithm can effectively reduce the cost of shuffle operations for join operations between large tables where the data in the join column does not match exactly.The less the amount of matching data between the right table and the left table,the more obvious the optimization effect of the algorithm.
关 键 词:SPARK JOIN SHUFFLE Semi Join
分 类 号:TP3[自动化与计算机技术—计算机科学与技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.15