检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:乔艺萌 荆一楠[2] 张寒冰 QIAO Yimeng;JING Yinan;ZHANG Hanbing(School of Software,Fudan University,Shanghai 200441,China;School of Computer Science,Fudan University,Shanghai 200433,China)
机构地区:[1]复旦大学软件学院,上海200441 [2]复旦大学计算机科学技术学院,上海200433
出 处:《计算机工程》2024年第1期30-38,共9页Computer Engineering
基 金:国家自然科学基金(62072113)。
摘 要:由于在大规模数据集上执行精确查询耗时较长,因此近似查询处理(AQP)技术常被用于在线分析处理,目的是以较短的交互延迟返回查询结果,并尽可能地降低查询误差。现有的学习型AQP方法与底层数据解耦,将I/O密集型计算转化为CPU密集型计算,但是由于计算资源的限制,该类方法通常基于随机的数据样本进行模型训练,此类训练数据会引起稀有群组缺失问题,导致模型预测准确性不高。针对上述问题,提出一种基于分层样本学习的混合型和积网络模型,并基于该模型设计一种AQP框架。分层样本能够有效避免稀有群组缺失现象,基于该样本训练的模型预测准确性大幅提升。此外,针对数据动态更新的情况,提出一种模型自适应更新策略,使得模型能够及时检测数据偏移现象并自适应地执行更新。实验结果表明,与基于抽样和基于机器学习的AQP方法相比,该模型在真实数据集和合成数据集上的平均相对误差分别约降低18.3%和2.2%,在数据动态更新的场景下,其准确性和查询时延均呈现出良好的稳定性。Owing to the significant latency of exact queries on large-scale datasets,Approximate Query-Processing(AQP)techniques are typically applied to online analytical processing to return query results within interactive timescales with minimal error.The existing learning-based AQP methods decouple the underlying data and convert I/O-intensive calculations into CPU-intensive calculations.However,because of the limitations of computing resources,model training is typically performed based on random data samples.Such training data eliminate rare populations,thus resulting in unsatisfactory prediction accuracy by the model.Hence,this paper proposes a Stratified Sampling-based Sum-Product Network(SSSPN)model and designs an AQP framework based on the abovementioned model.Stratified samples can effectively avoid the elimination of rare populations and significantly improves the model accuracy.Additionally,in terms of dynamic data updates,this paper proposes an adaptive model-update strategy that allows the model to detect data shifts timely and automatically perform updates adaptively.Experimental results show that compared with the performance of AQP methods based on sampling and machine learning,the average relative errors of this model on real and synthetic datasets are approximately 18.3%and 2.2%lower,respectively;in scenarios where data are dynamically updated,both the accuracy and query latency of the model are favorable.
关 键 词:近似查询处理 和积网络 分层抽样 数据偏移 自适应更新
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.28