基线与增量数据分离架构下的分布式连接算法被引量：6

A Distributed Join Algorithm on Separated Data Storage

机构地区：[1]华东师范大学数据科学与工程研究院,上海高可信计算重点实验室,上海200062

出　　处：《计算机学报》2016年第10期2102-2113,共12页Chinese Journal of Computers

基　　金：国家自然科学基金重点项目(61332006)资助

摘　　要：在大数据背景下,数据库系统表连接操作的效率急需优化,尤其对于基线与增量数据分离的数据库系统来说,其连接操作更是成为其性能的主要瓶颈.为了有效提升事务处理的性能,在基线与增量数据分离的数据库系统架构中,通常将基线数据存储于磁盘中,增量数据存储于内存中,进而获得较高的事务处理吞吐量和可扩展性.Hbase、BigTable、OceanBase等系统是典型的基线与增量数据分离的数据库管理系统,但是他们的表连接效率较低,其主要原因包括:每次表连接前必须先合并基线数据和增量数据;数据存储模式更为复杂,导致过大的网络开销.该文提出了一种基线与增量数据分离架构下的排序归并连接优化算法.该算法对连接属性做范围切分,在多个节点上并行做排序归并连接.该算法无需在连接前合并基线数据和增量数据,进而实现对基线和增量数据并行处理,同时也避免了大量非连接结果集数据的基线与增量合并操作.并在开源的数据库OceanBase上实现了该算法,通过一系列实验证明,该算法可以极大提高OceanBase数据库的表连接处理性能.In this big data era, the efficiency of join operator is needed to be optimized imperatively, especially for database systems with separated baseline and incremental data. In this database system architecture, the baseline data is stored in the disk as usual, while the incremental data is stored in the memory to achieve both higher transactional processing efficiency and scalability. HBase, BigTable, OceanBase are typical database systems deployed with such separated baseline and incremental data architecture, but they provided join operator with very low efficiency only. The main reasons are as follows, they have to merge the baseline data and incremental data at first; and the network overhead is very heavy because of the complex data model they used. This paper proposes an algorithm for efficient join operator based on separated baseline data and incremental data. It partitions the join attributes into specified ranges first and merges each range on different nodes in parallel. The key point of this algorithm is that it partitions, sorts the baseline data and incremental data separately to achieve even higher parallelism before merge join and avoids the cost of merge of the baseline and incremental data tuples which will not be appeared in the result set. We implement this algorithm based on OceanBase, an open sourced distributed database system. The experimental results confirm that our algorithm can improve the join performance of OceanBase database by a large margin.

关键词：分布式连接增量数据并行处理排序归并连接

分类号：TP311[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基线与增量数据分离架构下的分布式连接算法被引量：6

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基线与增量数据分离架构下的分布式连接算法 被引量：6

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基线与增量数据分离架构下的分布式连接算法被引量：6