检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:阎芳[1,2] 李元章[1] 张全新[1] 谭毓安[1]
机构地区:[1]北京理工大学计算机学院,北京100086 [2]北京物资学院信息学院,北京101149
出 处:《计算机研究与发展》2015年第7期1546-1557,共12页Journal of Computer Research and Development
基 金:国家"八六三"高技术研究发展计划基金项目(2013AA01A212);国家自然科学基金项目(61370063);北京高等学校青年英才计划项目(YETP1532;YETP1178)
摘 要:现有的重复数据删除技术大部分是基于变长分块(content defined chunking,CDC)算法的,不考虑不同文件类型的内容特征.这种方法以一种随机的方式确定分块边界并应用于所有文件类型,已经证明其非常适合于文本和简单内容,而不适合非结构化数据构成的复合文件.分析了OpenXML标准的复合文件属性,给出了对象提取的基本方法,并提出基于对象分布和对象结构的去重粒度确定算法.目的是对于非结构化数据构成的复合文件,有效地检测不同文件中和同一文件不同位置的相同对象,在文件物理布局改变时也能够有效去重.通过对典型的非结构化数据集合的模拟实验表明,在综合情况下,对象重复数据删除比CDC方法提高了10%左右的非结构化数据的去重率.Content defined chunking(CDC)is a prevalent data de-duplication algorithm for removing redundant data segments in storage systems.Current researches on CDC do not consider the unique content characteristic of different file types,and they determine chunk boundaries in a random way and apply a single strategy for all the file types.It has been proven that such method is suitable for text and simple contents,and it doesn't achieve the optimal performance for compound files.Compound file is composed of unstructured data,usually occupying large storage space and containing multimedia data.Object-based data de-duplication is the current most advanced method and is the effective solution for detecting duplicate data for such files.We analyze the content characteristic of OpenXML files and develop an object extraction method.A de-duplication granularity determining algorithm based on the object structure and distribution is proposed during this process.The purpose is to effectively detect the same objects in a file or between the different files,and to be effectively deduplicated when the file physical layout is changed for compound files.Through the simulation experiments with typical unstructured data collection,the efficiency is promoted by 10% compared with CDC method in the unstructured data in general.
关 键 词:变长分块 对象 非结构化数据 OpenXML标准 复合文件 重复数据删除
分 类 号:TP311[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.52