基于图结构索引的分布式OLAP加速方法  

Accelerating Distributed OLAP with Graph Structure Indexing

在线阅读下载全文

作  者:沈斯杰 陈榕[1] 陈海波[1] 臧斌宇[1] SHEN Si-Jie;CHEN Rong;CHEN Hai-Bo;ZANG Bin-Yu(Institute of Parallel and Distributed Systems,Shanghai Jiao Tong University,Shanghai 200240,China)

机构地区:[1]上海交通大学并行与分布式系统研究所,上海200240

出  处:《软件学报》2023年第10期4661-4680,共20页Journal of Software

基  金:国家自然科学基金面上项目(61772335)。

摘  要:随着业务数据的规模增大,一些重要的应用场景需要使用分布式在线分析处理(OLAP)支持大规模数据的分析,例如商务智能(BI),企业资源计划(ERP),用户行为分析等.同时,分布式OLAP打破单机存储的限制,可以将数据放在内存中以提升OLAP的处理性能.然而,基于内存的分布式OLAP在消除磁盘I/O后,性能瓶颈转移到了连接操作.连接操作是OLAP中的一种常用操作,会进行大量的数据读取与计算操作.通过对现有的几种连接操作方式进行分析,提出了一种能够加速连接操作的图结构索引以及基于图结构索引的连接操作方式LinkJoin.图结构索引通过用户所指定的连接关系,将数据在内存中的位置以图结构的形式进行存储.基于图结构索引的连接方式,不仅能够有等同于哈希连接的较低复杂度,而且在执行过程中能减少数据读取与计算操作次数.将目前先进的开源内存OLAP系统MonetDB从单机系统扩展成分布式系统,并且在该系统上设计与实现了基于图结构索引的连接操作方式.针对该系统的图索引结构,列式存储以及分布式执行引擎这3个重要方面,进行一系列设计与优化,以提升系统的分布式OLAP处理性能.测试结果表明,在TPC-H标准测试中,基于图结构索引的连接操作对于有连接操作的查询的平均性能提升达1.64倍(最多达4.1倍).对于这些查询中的连接操作,性能提升达9.8–22.1倍.As the scale of business data increases,distributed online analytical processing(OLAP)is widely performed in business intelligence(BI),enterprise resource planning(ERP),user behavior analysis,and other application scenarios to support large-scale data analysis.Moreover,distributed OLAP overcomes the limitations of single-machine storage and stores data in memory to improve the performance of OLAP.However,after the in-memory distributed OLAP eliminates disk input/output(I/O),the join operation becomes one of its new performance bottlenecks.As a common practice in OLAP,the join operation involves a huge amount of data accessing and computation operations.By analyzing existing methods for the join operation,this study presents a graph structure indexing method that can accelerate the join operation and a new join method called LinkJoin based on it.Graph structure indexing stores the in-memory position of data in the form of a graph structure according to the join relationship specified by users.The join method based on graph structure indexing reduces the amount of data accessing and computation operations with a low complexity equivalent to that of the hash join.This study expands the state-of-the-art open-source in-memory OLAP system called MonetDB from a single-machine system to a distributed one and designs and implements a join method based on graph structure indexing on it.A series of designs and optimizations are also conducted in the aspects of graph indexing structure,columnar storage,and distributed execution engine to improve the distributed OLAP performance of the system.The test results show that in the TPC-H benchmark tests,the join operation based on graph structure indexing improves the performance on queries with join operations by 1.64 times on average and 4.1 times at most.For the join operation part of these queries,it enhances the performance by 9.8-22.1 times.

关 键 词:OLAP系统 分布式系统 连接操作 索引技术 图结构 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象