一种高效的面向高并发图分析任务的存储系统  被引量:3

An efficient storage system towards high throughput of concurrent graph processing jobs

在线阅读下载全文

作  者:赵进 姜新宇 张宇[1,2,3,4] 廖小飞 金海[1,2,3,4] 刘海坤[1,2,3,4] 杨赟[1,2,3,4] 张吉[5] 王彪[6] 余婷 Jin ZHAO;Xinyu JIANG;Yu ZHANG;Xiaofei LIAO;Hai JIN;Haikun LIU;Yun YANG;Ji ZHANG;Biao WANG;Ting YU(National Engineering Research Center for Big Data Technology and System,Huazhong University of Science and Technology,Wuhan 430074,China;Service Computing Technology and System Lab,Huazhong University of Science and Technology,Wuhan 430074,China;Cluster and Grid Computing Lab,Huazhong University of Science and Technology,Wuhan 430074,China;School of Computer Science and Technology,Huazhong University of Science and Technology,Wuhan 430074,China;School of Sciences,University of Southern Queensland,Toowoomba 4350,Australia;Zhejiang Lab,Hangzhou 311121,China)

机构地区:[1]华中科技大学大数据技术与系统国家地方联合工程研究中心,武汉430074 [2]华中科技大学服务计算技术与系统教育部重点实验室,武汉430074 [3]华中科技大学集群与网格计算湖北省重点实验室,武汉430074 [4]华中科技大学计算机科学与技术学院,武汉430074 [5]School of Sciences University of Southern Queensland,Toowoomba 4350,Australia [6]之江实验室,杭州311121

出  处:《中国科学:信息科学》2022年第1期111-128,共18页Scientia Sinica(Informationis)

基  金:国家重点研发计划(批准号:2018YFB1003500);国家自然科学基金(批准号:61832006,61825202,62072193);之江实验室开放课题(批准号:2021KD0AB01)、之江实验室PI研究项目(批准号:111007-PI2001);浙江省自然科学基金(批准号:LZ21F030001)资助项目。

摘  要:随着现实世界中图计算需求的快速增长,同一平台上往往并发运行着大量迭代图分析任务.然而,现有的图计算系统主要是为了高效执行单个图分析任务而设计的.因此,当多个并发图分析任务同时在同一个底层图上并行执行时,现有图计算系统会面临巨大的数据访问开销.为了提高并发图分析任务的吞吐量,现有的核外并发图处理方案通过共享图数据减少并发任务的数据存储与访问开销.但是,由于现实世界中图的图顶点度数幂律分布特性以及图分析任务之间的差异性,现有方案在访问数据时依旧存在着大量的不必要的冗余I/O开销.这是因为即使静态图分区中绝大部分顶点处于非活跃状态或者只被少数图分析任务共享,现有方法也依旧会将整个分区加载入内存供并发图分析任务处理.为解决上述问题,本文提出了一个面向并发图分析任务的高效存储系统GraphDP.它能够插入到现有核外图计算系统中来透明有效地减少现有图计算系统执行并发图分析任务时的存储消耗与数据访问开销,从而提高并发图分析任务的吞吐量.具体来说,GraphDP使用一种新颖的动态I/O调度策略,能够使系统以最优的I/O访问方式完成图数据的加载,并有效地减少加载到内存和cache的数据.同时,GraphDP通过高效的缓存机制在内存中优先缓存被频繁访问的图数据,从而进一步减少数据访问开销.为证明GraphDP的有效性,我们将GraphDP插入到目前流行的核外图计算系统中,包括GridGraph,GraphChi和X-Stream.实验结果表明,GraphDP分别将GridGraph,GraphChi和X-Stream的吞吐量提高了1.57~2.19倍,1.86~2.37倍和1.62~2.21倍.With the rapidly growing demand of graph analytics,a large number of iterative graph processing jobs are often executed concurrently on the same platform.However,the existing graph processing system is mainly designed to efficiently perform a single graph processing job.Therefore when concurrent jobs are executed on the same underlying graph the system will suffer from high data access cost.To improve the throughput of the concurrent jobs,existing out-of-core concurrent graph processing schemes reduce the data storage and access cost by sharing graph data among these jobs.Due to the power-law property of real-world graphs and the differentiation between the concurrent jobs,existing schemes still suffer from a large amount of unnecessary I/O traffic,because even if the most proportion of the vertices in the static graph partition is inactive or only shared by a few jobs,this partition will still be entirely loaded into the memory for the processing of the concurrent jobs.To solve these problems,we propose an efficient storage system,called GraphDP,to achieve high throughput of concurrent graph processing jobs.It can be integrated into existing out-of-core graph processing systems to reduce the storage and data access overhead.Specifically,GraphDP uses a novel dynamic I/O scheduling strategy,which enables the graph data to be loaded in an optimal I/O mode and effectively reduces the redundant data loaded into the memory and cache.Meanwhile,GraphDP preferentially stores frequently accessed graph data in the memory via an efficient caching mechanism,which further reduces the data access overhead.To demonstrate the effectiveness of GraphDP,we integrate it into three state-of-the-art out-of-core graph processing systems,including GridGraph,GraphChi and X-Stream.Experiment results show that GraphDP improves the throughput of GridGraph,GraphChi and X-Stream by 1.57–2.19 times,1.86–2.37 times and 1.62–2.21 times,respectively.

关 键 词:迭代图处理 并发任务 存储系统 I/O开销 吞吐量 

分 类 号:TP333[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象