面向多表连接查询优化的基数估计方法

Cardinality Estimation Method for Multitable JOIN Query Optimization

作　　者：钱文渊荆一楠[1] 王晓阳[1] 吴振环 QIAN Wenyuan;JING Yinan;WANG Xiaoyang;WU Zhenhuan(School of Computer Science,Fudan University,Shanghai 200433,China;Information Center,Ministry of Justice,Beijing 100020,China)

机构地区：[1]复旦大学计算机科学技术学院,上海200433 [2]司法部信息中心,北京100020

出　　处：《计算机工程》2022年第6期167-173,共7页Computer Engineering

基　　金：国家自然科学基金(61732004)。

摘　　要：基数估计是实现数据库多表连接(JOIN)查询优化的重要手段之一。对数据量较大的数据表进行基数估计时常用数据抽样来获得较小的样本,从而估计各种查询负载下所需的数据基数。在单表上利用数据抽样来完成基数估计的方法已经得到广泛研究,但在多个数据表的抽样样本总体存储预算存在限制时,目前仍缺乏有效的多表间样本数划分方法使得整体基数估计达到较优。为此,提出一种面向多表JOIN查询优化的基数估计方法,针对一组给定的含有复杂多JOIN操作的查询负载,为其合理分配数据库中每个表的抽样率,从而在满足样本大小总和限制的同时使得基数估计准确率达到最高。将上述过程抽象为一个抽样率分配搜索问题,在数据库数据抽样问题中引入贝叶斯优化搜索算法,利用该算法快速搜索出不同表之间抽样样本大小的分配比例,使得有限时间内获得的样本分配方案对应的基数估计准确率最高,从而达到查询优化的目的。在TPC-H数据集上的实验结果表明,在相同时间内确定多JOIN操作查询负载下基数估计准确率最高的抽样比例方案时,相比随机搜索算法,贝叶斯优化算法所得方案对应的基数估计误差率降低54.8%~60.2%。Cardinality estimation is one of the important means to optimize database multitable JOIN queries.When the cardinality of a data table with a large number of data is estimated,data sampling is often used to obtain smaller samples to estimate the data cardinality required under various query loads.The method of using data sampling to complete cardinality estimation on a single table has been widely studied;however,when there are restrictions on the overall storage budget of sampling samples in multiple data tables,an effective sample number division method between multiple tables to improve the overall cardinality estimation is lacking.Therefore,a cardinality estimation method for multitable JOIN query optimization is proposed.For a given set of query loads with complex multiple JOIN operations,the sampling rate of each table in the database is reasonably allocated to maximize the accuracy of cardinality estimation while meeting the limit of the sum of sample sizes.The above process is abstracted as a sampling rate allocation search problem,and the Bayesian optimization search algorithm is introduced into the database data sampling problem.This algorithm is used to search the allocation proportion of sampling sample size between different tables quickly,so that the cardinality estimation accuracy corresponding to the sample formula obtained in a limited time is the highest,thereby achieving query optimization.The experimental results on the TPC-H dataset show that,when determining the sampling proportion scheme with the highest cardinality estimation accuracy under the query load of multiple JOIN operations in the same time,compared with the random search algorithm,the cardinality estimation error rate corresponding to the scheme obtained by Bayesian optimization algorithm is reduced by 54.8% to 60.2%.

关键词：多表连接查询优化基数估计数据抽样贝叶斯优化

分类号：TP311[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

面向多表连接查询优化的基数估计方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

面向多表连接查询优化的基数估计方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索