一种基于实例的数据转换方法  

An Instance-Based Data Transformation Method

在线阅读下载全文

作  者:薄凤羽 李贵[1] 李征宇[1] 韩子扬[1] 曹科研 

机构地区:[1]沈阳建筑大学,信息与控制工程学院,辽宁 沈阳

出  处:《数据挖掘》2022年第3期235-245,共11页Hans Journal of Data Mining

摘  要:Web中包含大量有用的信息,但由于它们是半结构化的,非专家用户在进行数据转换和集成时不能很好地利用。为此本文提出了一种基于实例的数据转换方法,用户只需要提供适当的输入–输出示例就可以得到所需的转换。首先,利用基于序列比对的模式距离度量方法依据用户提供的示例生成代表性示例;其次,提出了一种基于信息熵的代码分析方法,利用该方法与代表性示例结合来筛选与转换任务相关的候选函数;最后,通过函数排名将相关函数先进行列转换,再行合成与所有示例一致的数据转换程序。本文利用房地产领域数据集进行了实验评估,结果表明,该方法可以处理目前许多现有系统不支持的常见转换,并且能够实现实验系统中近80%的数据转换,其准确率远高于其他同类型系统。The Web contains a lot of useful information, but because it is semi-structured, non-expert users are not able to make good use of it in data transformation and integration. Therefore, this paper pro-poses an instance-based data transformation method. Users only need to provide appropriate in-put-output examples to get the required transformation. First, a pattern distance measurement method based on sequence alignment is used to generate representative examples from us-er-provided examples. Secondly, a code analysis method based on information entropy is proposed, which is combined with representative examples to screen candidate functions related to transformation tasks. Finally, the related functions are converted into rows and columns through function rankings, and then a data conversion program is synthesized that is consistent with all the examples. In this paper, we use real estate data set to carry out experimental evaluation, and the results show that this method can deal with many common conversions that are not supported by existing systems, and can achieve nearly 80% of the data conversions in the experimental system, and its accuracy is much higher than other systems of the same type.

关 键 词:数据转换 代码分析 半结构化 距离度量 信息熵 序列比对 相关函数 Web 

分 类 号:TP311.13[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象