删冗系统数据冗余特征挖掘  被引量:1

Mining of Data Redundancy Characteristic in Deduplication Systems

在线阅读下载全文

作  者:江志雄 陆春阳 余宏亮[2] 

机构地区:[1]中国石油昌平数据中心,北京102206 [2]清华大学高性能计算研究所,北京100084

出  处:《小型微型计算机系统》2014年第10期2237-2242,共6页Journal of Chinese Computer Systems

基  金:国家"八六三"高技术研究发展计划项目(2012AA012600)资助

摘  要:作为一项能够有效缩减数据存储量的技术,重复数据删除在存储系统中获得广泛应用.然而,目前针对删冗系统数据冗余特征的研究存在不足,大多仅关注如何提高针对特定数据集的删冗率.本文对删冗系统文件层次的数据冗余特征进行更深入的挖掘.首先基于冗余数据块定义了文件和文件集合相关性的概念,将相关性挖掘问题归结为频繁项集挖掘问题.然后给出离线生成事务组数据库的流程,以便应用频繁项集挖掘算法进行相关性计算.最后提出内嵌到删冗系统之中的相关性挖掘增量式算法,从而准实时地进行数据冗余特征分析.通过本文工作可以更好地理解删冗系统中冗余数据的来源和分布,从而针对实际应用环境设计更有效的删冗算法和IO优化算法.Data Deduplication is widely adopted in storage systems as an effective storage saving technique. However, currently most studies on data redundancy characteristic of dedup systems only focus on increasing data dedup ratios for specific datasets. This paper presents a novel study on file-level data redundancy characteristic of dedup systems. Firstly we define the correlation of files and filesets based on the duplicate data blocks they share, and reduce the correlation mining problem to the well-studied frequent itemset mining problem. Secondly we propose the process of transforming the dedup-metadata into transaction group database in order to apply algorithms of frequent itemset mining. Finally we design an incremental algorithm for correlation mining, which can be embedded into the dedup storage system to achieve near-realtime analysis of data redundancy characteristic. Our work can be used to understand the sources and distributions of redundancy data in dedup systems better. Consequently it can help to design more adaptive dedup algorithms and IO optimization algorithms according to the specific application environments.

关 键 词:重复数据删除 存储系统 数据冗余特征 频繁项集挖掘 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象