Scalable high performance de-duplication backup via hash join

Scalable high performance de-duplication backup via hash join

作　　者：Tian-ming YANG Dan FENG Zhong-ying NIU Ya-ping WAN

机构地区：[1]Wuhan National Laboratory for Optoelectronics, School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China

出　　处：《Journal of Zhejiang University-Science C(Computers and Electronics)》2010年第5期315-327,共13页浙江大学学报C辑（计算机与电子（英文版）

基　　金：supported by the National Basic Research Program (973) of China (No.2004CB318201);the National High-Tech Research and Development Program (863) of China (No.2008AA01A402);the National Natural Science Foundation of China (Nos.60703046 and 60873028)

摘　　要：Apart from high space efficiency,other demanding requirements for enterprise de-duplication backup are high performance,high scalability,and availability for large-scale distributed environments.The main challenge is reducing the significant disk input/output(I/O) overhead as a result of constantly accessing the disk to identify duplicate chunks.Existing inline de-duplication approaches mainly rely on duplicate locality to avoid disk bottleneck,thus suffering from degradation under poor duplicate locality workload.This paper presents Chunkfarm,a post-processing de-duplication backup system designed to improve capacity,throughput,and scalability for de-duplication.Chunkfarm performs de-duplication backup using the hash join algorithm,which turns the notoriously random and small disk I/Os of fingerprint lookups and updates into large sequential disk I/Os,hence achieving high write throughput not influenced by workload locality.More importantly,by decentralizing fingerprint lookup and update,Chunkfarm supports a cluster of servers to perform de-duplication backup in parallel;it hence is conducive to distributed implementation and thus applicable to large-scale and distributed storage systems.Apart from high space efficiency, other demanding requirements for enterprise de-duplication backup are high performance, high scalability, and availability for large-scale distributed environments. The main challenge is reducing the significant disk input/output （I/O） overhead as a result of constantly accessing the disk to identify duplicate chunks. Existing inline de-duplication approaches mainly rely on duplicate locality to avoid disk bottleneck, thus suffering from degradation under poor duplicate locality workload. This paper presents Chunkfarm, a post-processing de-duplication backup system designed to improve capacity, throughput, and scalability for de-duplication. Chunkfarm performs de-duplication backup using the hash join algorithm, which turns the notoriously random and small disk I/Os of fingerprint lookups and updates into large sequential disk I/Os, hence achieving high write throughput not influenced by workload locality. More importantly, by decentralizing fingerprint lookup and update, Chunkfarm supports a cluster of servers to perform de-duplication backup in parallel; it hence is conducive to distributed implementation and thus applicable to large-scale and distributed storage systems.

关键词：Backup system De-duplication POST-PROCESSING Fingerprint lookup Scalability

分类号：TP309.3[自动化与计算机技术—计算机系统结构]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

Scalable high performance de-duplication backup via hash join

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

Scalable high performance de-duplication backup via hash join

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索