检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:李旭[1] 田源 邓红梅 赵淑颖 高俊涛[2] LI Xu;TIAN Yuan;DENG Hongmei;ZHAO Shuying;GAO Juntao(Institute of Survey,Design and Informatisation,Jidong Oilfield Branch of PetroChina Company Limited,Tangshan,Hebei 063004,China;School of Computer&Information Technology,Northeast Petroleum University,Daqing,Heilongjiang 163318,China)
机构地区:[1]中国石油冀东油田分公司勘察设计与信息化研究院,河北唐山063004 [2]东北石油大学计算机与信息技术学院,黑龙江大庆163318
出 处:《东北石油大学学报》2023年第6期112-124,I0008,共14页Journal of Northeast Petroleum University
基 金:东北石油大学特色领域团队专项(2022TSTD-03)。
摘 要:为解决手工制定数据质量规则费时费力且容易出错的问题,基于正则推断理论,研究从正样本推断自动生成数据格式规则的方法,提出多尺度样本增强、循环模式和公共子序列抽取的样例泛化策略,构造格式规则候选空间,证明多尺度样本增强的合理性,分析公共子序列对格式规则质量的影响;基于编码成本构造目标函数,利用整数规划方法对候选规则的组合优化问题建模,推荐较优数据质量规则给数据治理者。真实数据集和模拟数据集实验结果表明:该方法生成的规则质量比同类方法平均提高70%,验证算法的可行性和有效性。该方法可以提升制定和管理数据格式规则的效率。To address the time-consuming and error-prone nature of manually formulating data quality rules,a method to automatically generate data format rules from positive samples is studied based on regular inference theory.A sample generalization strategy based on multi-scale sample enhancement,cycle number generalization,and common subsequence extraction is proposed to construct the candidate regular expression space.The rationality of multi-scale sample enhancement is proven,and the impact of common subsequences on the encoding cost of regular expressions is analyzed.A target function is constructed based on encoding cost,and the combinatorial optimization problem of candidate expressions is modeled by integer programming,selecting concise regular expressions as recommendations for data quality managers.Experimental results on both real and simulated datasets show that the proposed method improves rule quality by an average of 70%compared to other approaches,validating the feasibility and effectiveness of the algorithm.This method enhances the efficiency of formulating and managing data format rules.
关 键 词:数据质量规则 数据格式规则 正则表达式 正则推断
分 类 号:TP391.7[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.117