基于改进CDC的实验原始记录匹配算法  

An algorithm for matching original experimental records based on improved CDC

在线阅读下载全文

作  者:蔡伊娜 陈新 覃志武 王歆 包先雨 彭锦学 林泳奇 李俊霖 CAI Yina;CHEN Xin;QIN Zhiwu;WANG Xin;BAO Xianyu;PENG Jinxue;LIN Yongqi;LI Junlin(Shenzhen Academy of Inspection and Quarantine,Shenzhen 518045,Guangdong Province,P.R.China;Food Inspection and Quarantine Center,Shenzhen Customs,Shenzhen 518045,Guangdong Province,P.R.China;Information Center,Shenzhen Customs,Shenzhen 518045,Guangdong Province,P.R.China)

机构地区:[1]深圳市检验检疫科学研究院,广东深圳518045 [2]深圳海关食品检验检疫技术中心,广东深圳518045 [3]深圳海关信息中心,广东深圳518045

出  处:《深圳大学学报(理工版)》2022年第5期509-514,共6页Journal of Shenzhen University(Science and Engineering)

基  金:国家重点研发计划资助项目(2019YFC1605504,2018YFC1603601)。

摘  要:针对当前实验室检测报告的生成过程存在时间长和易出现偶然性差错等问题,提出基于栅栏因子的通用实验原始记录文件自动抓取技术.先通过计算文件整体hash值准确过滤当日已读取文件,再使用改进的内容可变长度分块(content-defined chunking,CDC)算法进行文本分块.该CDC算法改进之处主要体现在:设定滑动窗口下一单位为行与行间距之和的高度以及滑动窗口内字节大小的范围.待文本分块结束后,使用基于数据块索引的字符串匹配算法完成匹配.该字符串匹配算法结合数据块索引表构建模式串与数据块的映射关系,之后由模式串Pn通过数据块索引表快速匹配到相应数据块.使用海关实验室的实验原始记录文件进行测试,实验证明,该算法的内存占用量少且分块吞吐量更大.Aiming at the problems such as long time and occasional errors in the generation process of the current laboratory test report,we present an automatic capture technology of general original experimental records based on fence factor.First,the read files of the day are accurately filtered by calculating the overall Hash value of file.Then,we use the improved content-defined chunking(CDC)algorithm for chunking.The improvement of CDC algorithm includes setting the unit of the sliding window as the spacing of between two lines and setting the range of the byte size in the sliding window.When the text block is completed,a string matching algorithm based on pattern string is used to complete the matching process.The string matching algorithm constructs the mapping relationship between the pattern string and data block in data block index table,and then quickly matches the pattern string Pn to corresponding data block through the data block index table.The original experimental record files of customs laboratory are used for testing.The algorithm occupies the least memory and has the largest chunking throughput.

关 键 词:计算机应用 数据块 模式串 字符串匹配 实验原始记录 内容可变长度分块算法 实验室检测报告 

分 类 号:TP301.6[自动化与计算机技术—计算机系统结构] TP391.1[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象