Web数据集成中缺失数据处理方法研究

Research on Missing Data Processing Methods in Web Data Integration

出　　处：《数据挖掘》2021年第4期226-240,共15页Hans Journal of Data Mining

摘　　要：数据预处理是web数据集成中的一个重要步骤,修复缺失数据是数据预处理的重要组成部分。在web数据集成中修复缺失数据的关键问题是缺失点没有可直接提供参考的观察值,这导致用户不能使用估算和推理的方法,只能依靠有经验的用户或领域专家通过制定规则才能填充数据。然而,对于具有成千上万个缺失点的大型数据库,由用户理解数据并制定有效的填充规则是不可行的。因为在修复缺失数据时,用户需要了解哪些候选子集对缺失点填充概率和覆盖程度最大。然而,给用户推荐填充概率和覆盖程度最大的候选子集计算量非常大。为了解决这个问题,本文提出了一种基于信息熵的生成候选子集算法,通过用户对初始候选子集的编辑,使用该算法计算出缺失点填充概率和覆盖程度最大的候选子集。通过用户选择的候选子集并依据数据集中一对多的关联关系生成并推荐缺失点覆盖程度更高的规则,并将用户选择的规则通过数据集中一对多的关联关系泛化至更多的缺失点中。经过原型系统实现结果表明,用该方法修复的数据具有较高的精度,同时,实验表明普通用户在短时间内便可修复大量缺失数据,有效地提高了数据修复的效果。Data preprocessing is an important step in data integration, and repairing missing data is an important part of data preprocessing. The key problem in repairing missing data in data integration is that the missing points have no observations that can directly provide reference. This makes users unable to use estimation and reasoning methods, and can only fill in data by relying on experienced users or domain experts to formulate rules. However, for a large database with thousands of missing points, it is not feasible for users to understand the data and formulate effective filling rules. Be-cause when repairing missing data, users need to know which candidate subsets have the greatest filling probability and coverage of missing points. However, it is very computationally intensive to recommend the candidate subset with the largest filling probability and coverage to the user. In order to solve this problem, this paper proposes an algorithm for generating candidate subsets based on information entropy. Through the user’s editing of the initial candidate subsets, the algorithm is used to calculate the missing point filling probability and the candidate subset with the largest coverage. The rules with higher coverage of missing points are generated and recommended based on the candidate subset selected by users and the one-to-many association relation in the data set, and the rules selected by users are generalized to more missing points through the one-to-many association relation in the data set. The results of the prototype system implementation show that the data repaired by this method has high accuracy. At the same time, experiments show that ordinary users can repair a large number of missing data in a short time, which effectively improves the effect of data repair.

关键词：数据预处理 WEB数据集成候选子集缺失点信息熵

分类号：TP3[自动化与计算机技术—计算机科学与技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

Web数据集成中缺失数据处理方法研究

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

Web数据集成中缺失数据处理方法研究

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索