基于词嵌入的元组级数据溯源方法  被引量:3

A Tuple-level Data Lineage Approach Based on Word Embedding

在线阅读下载全文

作  者:杨彬 高俊涛[1] 王志宝[1] 李菲 马强 江树涛 YANG Bin;GAO Jun-tao;WANG Zhi-bao;LI Fei;MA Qiang;JIANG Shu-tao(School of Computer and Information Technology,Northeast Petroleum University,Daqing 163318,China;School of Information and Electrical Engineering,Heilongjiang Bayi Agricultural University,Daqing 163319,China)

机构地区:[1]东北石油大学计算机与信息技术学院,黑龙江大庆163318 [2]黑龙江八一农垦大学信息与电气工程学院,黑龙江大庆163319

出  处:《计算机技术与发展》2023年第12期49-57,共9页Computer Technology and Development

基  金:国家自然科学基金资助项目(61902222);东北石油大学优秀中青年科研创新团队培育基金(KYCXTDQ202101)。

摘  要:在信息爆炸时代,数据量与日剧增,使用数据挖掘技术可挖掘其内在联系,但前提是所使用的数据正确无误,否则其后的一切工作将毫无意义。数据溯源技术可帮助数据分析人员快速定位到错误数据的来源和加工过程,减少错误数据的分析时间和难度,对数据质量控制与可信管理具有重要价值。现有的元组级数据溯源方法存在存储开销大和溯源效率低的问题,文章使用词嵌入技术改进元组级数据溯源方法。首先,研究元组向量化编码机制,依据元组向量相似度识别元组溯源关系;其次,提出基于属性重要性的优化算法提高溯源的精确率;再次,引入近似最近邻搜索和元组过滤优化机制降低溯源时间复杂度;最后,采用有向无环图展示元组数据的溯源关系。实验结果表明,该方法精确率较高、时间复杂度较低、存储消耗较少,能够有效改进元组级数据溯源方法。In the era of information explosion,the volume of data is increasing day by day,and data mining technology can be used to explore its inner connection,but only if the data used is correct,otherwise all the subsequent work will be meaningless.Data lineage technology can help data analysts quickly locate the source and processing process of erroneous data,reduce the time and difficulty of analyzing erroneous data,and has important value for data quality control and trustworthy management.The existing tuple-level data lineage methods have the problems of high storage overhead and low lineage efficiency,and we use word embedding technology to improve the tuple-level data lineage methods.Firstly,the tuple vectorization encoding mechanism is investigated and tuple lineage relationships based on the similarity of tuple vectors is identified.Secondly,an optimization algorithm based on attribute importance is proposed to improve the precision of lineage.Thirdly,the approximate nearest neighbor search and tuple filtering optimization mechanism is used to reduce the lineage time complexity.Finally,a directed acyclic graph is used to show the lineage relationships of tuple data.The experiment shows that the proposed method has higher lineage precision,lower time complexity and less storage consumption,and can effectively improve the tuple-level data lineage method.

关 键 词:结构化数据 数据溯源 元组向量 相似度比较 词嵌入 

分 类 号:TP311.13[自动化与计算机技术—计算机软件与理论] TP391[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象