基于正则推断的数据格式规则生成方法  

A method to generate data format rules based on regular inference

在线阅读下载全文

作  者:李旭[1] 田源 邓红梅 赵淑颖 高俊涛[2] LI Xu;TIAN Yuan;DENG Hongmei;ZHAO Shuying;GAO Juntao(Institute of Survey,Design and Informatisation,Jidong Oilfield Branch of PetroChina Company Limited,Tangshan,Hebei 063004,China;School of Computer&Information Technology,Northeast Petroleum University,Daqing,Heilongjiang 163318,China)

机构地区:[1]中国石油冀东油田分公司勘察设计与信息化研究院,河北唐山063004 [2]东北石油大学计算机与信息技术学院,黑龙江大庆163318

出  处:《东北石油大学学报》2023年第6期112-124,I0008,共14页Journal of Northeast Petroleum University

基  金:东北石油大学特色领域团队专项(2022TSTD-03)。

摘  要:为解决手工制定数据质量规则费时费力且容易出错的问题,基于正则推断理论,研究从正样本推断自动生成数据格式规则的方法,提出多尺度样本增强、循环模式和公共子序列抽取的样例泛化策略,构造格式规则候选空间,证明多尺度样本增强的合理性,分析公共子序列对格式规则质量的影响;基于编码成本构造目标函数,利用整数规划方法对候选规则的组合优化问题建模,推荐较优数据质量规则给数据治理者。真实数据集和模拟数据集实验结果表明:该方法生成的规则质量比同类方法平均提高70%,验证算法的可行性和有效性。该方法可以提升制定和管理数据格式规则的效率。To address the time-consuming and error-prone nature of manually formulating data quality rules,a method to automatically generate data format rules from positive samples is studied based on regular inference theory.A sample generalization strategy based on multi-scale sample enhancement,cycle number generalization,and common subsequence extraction is proposed to construct the candidate regular expression space.The rationality of multi-scale sample enhancement is proven,and the impact of common subsequences on the encoding cost of regular expressions is analyzed.A target function is constructed based on encoding cost,and the combinatorial optimization problem of candidate expressions is modeled by integer programming,selecting concise regular expressions as recommendations for data quality managers.Experimental results on both real and simulated datasets show that the proposed method improves rule quality by an average of 70%compared to other approaches,validating the feasibility and effectiveness of the algorithm.This method enhances the efficiency of formulating and managing data format rules.

关 键 词:数据质量规则 数据格式规则 正则表达式 正则推断 

分 类 号:TP391.7[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象