基于深度生成模型的聚合查询区间估计方法

Aggregated Query Interval Estimation Method Based on Depth Generative Model

作　　者：房俊[1,2] 薛晓东周云亮 FANG Jun;XUE Xiaodong;ZHOU Yunliang(School of Information,North China University of Technology,Beijing 100144,China;Beijing Key Laboratory on Integration and Analysis of Large-scale Stream Data,Beijing 100144,China)

机构地区：[1]北方工业大学信息学院,北京100144 [2]大规模流数据集成与分析技术北京市重点实验室,北京100144

出　　处：《计算机工程》2023年第11期284-292,301,共10页Computer Engineering

基　　金：国家自然科学基金国际(地区)合作与交流项目(62061136006)。

摘　　要：目前大多数近似查询方法都是用一个估计值来回答查询,这种点估计的方法虽然简单但是会存在误差。区间估计方法需要在大量样本上完成计算,会造成较高的查询时延,导致在实际中难以广泛应用。以模型驱动的近似查询技术虽在效率上有一定优势,但其查询结果缺乏可靠性保障。为此,提出一种融合数据抽样和机器学习算法的近似查询方法,通过深度生成模型提高查询效率,用区间估计代替点估计来回答查询,即通过多个样本的查询结果来生成一个相对可靠的区间结果。首先利用改进的生成对抗网络模型学习数据分布,在不访问数据集的情况下快速生成多个样本,然后利用大规模并行处理架构来分配计算任务,完成样本生成和查询执行的过程,最后将查询结果返回给用户。实验结果表明,该方法得出的聚合查询区间估计结果的归一化置信区间覆盖率(NCIC)达到85%以上,在聚合函数为COUNT且选择性低于0.03的查询实验中,针对ROAD、PM2.5这2个数据集,该方法的NCIC较随机抽样方法分别提高了13.9%和14.8%,虽然其查询时延相较基准方法有所增加,但是也可满足常规应用要求。Currently,most approximate query methods use estimation to answer a query.Although this type of point estimation is simple,it consistently produces errors.Because it must complete calculation on the basis of a large number of samples,the interval estimation method causes high query delay and is difficult to apply in practice.Although the model-driven approximate query technique has advantages in terms of efficiency,its query results lack reliability.To address this challenge,an approximate query method combining data sampling and machine learning algorithms is proposed herein.The depth generation model is used to improve query efficiency,and instead of point estimation,interval estimation is used to answer the query.Thus,a relatively reliable interval result is generated through multiple sample query results.First,the improved Generative Adversarial Network(GAN)model is used to learn the data distribution,and subsequently,multiple samples are rapidly generated without accessing the dataset.The massive parallel processing architecture is used to assign computing tasks,complete the sample generation and query execution processes,and finally the query results are obtained.Experimental results demonstrate that the Normalized Confidence Interval Coverage(NCIC)of the aggregate query interval estimated results obtained by the proposed method is over 85%.In a query experiment with the aggregate function COUNT and selectivity lower than 0.03,for ROAD and PM2.5 datasets,the NCIC for this method is 13.9%and 14.8%higher,respectively,than the random sampling method.Although the query delay increases compared with the benchmark method,it was confirmed that the proposed solution meets common application requirements.

关键词：近似查询生成模型并行计算区间估计抽样

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于深度生成模型的聚合查询区间估计方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于深度生成模型的聚合查询区间估计方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索