Spark查询引擎中Join操作的优化  被引量:1

OPTIMIZATION OF JOIN OPERATION IN SPARK QUERY ENGINE

在线阅读下载全文

作  者:赵丽梅 黄小菊[1] 宫学庆[1] Zhao Limei;Huang Xiaoju;Gong Xueqing(School of Software Engineering,East China Normal University,Shanghai 200062,China)

机构地区:[1]华东师范大学软件工程学院,上海200062

出  处:《计算机应用与软件》2022年第8期44-50,共7页Computer Applications and Software

摘  要:Spark是基于Map/Reduce计算模型进行大规模数据处理的分布式系统,每个任务都会被分为很多Map处理和Reduce处理在各个节点上并行执行。Shuffle操作是用于连接Map处理和Reduce处理的桥梁。在对两个大表进行Join操作的过程中,如果两表Join列不完全匹配,Spark中现有的Join实现算法会对大量数据进行shuffle操作,严重影响执行效率。提出一种基于Semi Join思想的Join实现算法——Semi Sort Merge Join,通过对左表Join列数据所构建的HashMap对右表数据进行过滤,可以有效减少Shuffle操作过程中所需传输的数据量。算法分析和实验结果表明,对于Join列数据不完全匹配的大表间Join操作,该算法能有效减少Shuffle操作的开销,右表与左表匹配数据量越少,算法优化的效果越明显。Spark is a distributed system for large-scale data processing based on the Map/Reduce computing model.Each task will be divided into many Map processing and Reduce processing to be executed in parallel on each node.The Shuffle operation is a bridge for connecting Map processing and Reduce processing.During the join operation on two large tables,if the join columns of the two tables do not match exactly,the existing join implementation algorithm in Spark will perform a shuffle operation on a large amount of data,which seriously affects the execution efficiency.Based on the idea of Semi Join,this paper proposes a join implementation algorithm called Semi Sort Merge Join.By filtering the data in the right table through the HashMap constructed from the join data of the left table,it could effectively reduce the amount of data needed to be transferred during the shuffle operation.Algorithm analysis and experimental results show that the algorithm can effectively reduce the cost of shuffle operations for join operations between large tables where the data in the join column does not match exactly.The less the amount of matching data between the right table and the left table,the more obvious the optimization effect of the algorithm.

关 键 词:SPARK JOIN SHUFFLE Semi Join 

分 类 号:TP3[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象