不确定时态数据Top-k查询

Efficient Top-k Query Processing on Uncertain Temporal Data

作　　者：韦建华许建秋[1] WEI Jian-hua;XU Jian-qiu(Department of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 210016,China)

机构地区：[1]南京航空航天大学计算机科学与技术学院,南京210016

出　　处：《计算机科学》2020年第9期67-73,共7页Computer Science

基　　金：国家自然科学基金(61972198)。

摘　　要：时态数据在医疗、经济和电子商务等领域有着广泛的应用。由于时间的测量技术不精确等因素,时态数据具有不确定性。文中针对该数据进行研究,处理Top-k查询,即返回与查询点相交的k个权值最大的数据,该权值是根据数据权值和相交概率按一定规则组合计算所得。为有效解决该查询问题,提出了一个基于关系模型和辅助结构的2D R-tree结构,其中关系模型用于管理所有区间数据的R-tree,辅助结构用于管理R-tree中每个节点内部数据权值的大小关系。基于该结构,提出了按权值的降序访问数据的查询算法。从根节点开始遍历R-tree,对于与查询点相交的节点,根据辅助结构中存储的信息找到数据权值最大的项,将它确定为下一个访问对象。实验使用数据规模在30万到1000万的合成数据集,以及包括大约320万条的航班信息的真实数据集。在可扩展数据库SECONDO系统下,将所提方法与无索引方法、R-tree和区间树方法在性能上进行比较,并以平均I/O访问次数和CPU时间作为性能的评判指标。实验结果表明,在1000万条的数据规模下,所提方法优于对比方法2~3个数量级。通过将实验返回的k个结果的概率与权值和实际相交数据的概率和权值作比较可以发现,实验返回的k个结果的概率与权值均靠近实际相交数据的概率和权值的最大值,因此所提算法可行且有效。Temporal data is widely used in many applications such as medical,economic and e-commerce.The uncertainty is mainly caused by factors such as inaccurate measurement techniques.This paper studies top-k queries over uncertain temporal data.Such a query returns Top-k intervals with the largest scores which are calculated by a function combining the original weight of the data and the probability of intersection with the query data.To answer the query efficiently,this paper proposes a 2D R-tree based on the relational model and auxiliary structures.The relational model is used to manage all intervals,and the auxiliary structure is used to manage the order of the weights of each node in the R-tree.Based on the proposed index structure,a query algorithm for accessing data in descending order by weights is proposed.It traverses the R-tree from the root node.For each node that intersects with the query point,the item with the largest weight in it can be found according to the information stored in the auxiliary structure,and it is determined as the next accessed object.This paper uses synthetic datasets with data sizes ranging from 300000 to 10 million,and a real dataset of a flight information with size of 3.2 million.In the extensible database system SECONDO,the proposed method is compared with the unindexed method,R-tree,and interval tree,and the average I/O access times and CPU time are used as the indicators of the experimental results.The experimental results show that the proposed approach outperforms baseline methods by 2 to 3 orders of magnitude using 10 million intervals.Comparing the probabilities and weights of the k results with the results of all intersecting data,it is found that the probabilities and weights of the k results are close to the maximum value of the actual intersecting data,so the proposed algorithm is feasible and effective.

关键词：时态数据不确定性 TOP-K

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

不确定时态数据Top-k查询

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

不确定时态数据Top-k查询

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索