检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:袁辉英 李贵[1] 李征宇[1] 韩子扬[1] 曹科研
机构地区:[1]沈阳建筑大学,信息与控制工程学院,辽宁 沈阳
出 处:《数据挖掘》2021年第4期226-240,共15页Hans Journal of Data Mining
摘 要:数据预处理是web数据集成中的一个重要步骤,修复缺失数据是数据预处理的重要组成部分。在web数据集成中修复缺失数据的关键问题是缺失点没有可直接提供参考的观察值,这导致用户不能使用估算和推理的方法,只能依靠有经验的用户或领域专家通过制定规则才能填充数据。然而,对于具有成千上万个缺失点的大型数据库,由用户理解数据并制定有效的填充规则是不可行的。因为在修复缺失数据时,用户需要了解哪些候选子集对缺失点填充概率和覆盖程度最大。然而,给用户推荐填充概率和覆盖程度最大的候选子集计算量非常大。为了解决这个问题,本文提出了一种基于信息熵的生成候选子集算法,通过用户对初始候选子集的编辑,使用该算法计算出缺失点填充概率和覆盖程度最大的候选子集。通过用户选择的候选子集并依据数据集中一对多的关联关系生成并推荐缺失点覆盖程度更高的规则,并将用户选择的规则通过数据集中一对多的关联关系泛化至更多的缺失点中。经过原型系统实现结果表明,用该方法修复的数据具有较高的精度,同时,实验表明普通用户在短时间内便可修复大量缺失数据,有效地提高了数据修复的效果。Data preprocessing is an important step in data integration, and repairing missing data is an important part of data preprocessing. The key problem in repairing missing data in data integration is that the missing points have no observations that can directly provide reference. This makes users unable to use estimation and reasoning methods, and can only fill in data by relying on experienced users or domain experts to formulate rules. However, for a large database with thousands of missing points, it is not feasible for users to understand the data and formulate effective filling rules. Be-cause when repairing missing data, users need to know which candidate subsets have the greatest filling probability and coverage of missing points. However, it is very computationally intensive to recommend the candidate subset with the largest filling probability and coverage to the user. In order to solve this problem, this paper proposes an algorithm for generating candidate subsets based on information entropy. Through the user’s editing of the initial candidate subsets, the algorithm is used to calculate the missing point filling probability and the candidate subset with the largest coverage. The rules with higher coverage of missing points are generated and recommended based on the candidate subset selected by users and the one-to-many association relation in the data set, and the rules selected by users are generalized to more missing points through the one-to-many association relation in the data set. The results of the prototype system implementation show that the data repaired by this method has high accuracy. At the same time, experiments show that ordinary users can repair a large number of missing data in a short time, which effectively improves the effect of data repair.
关 键 词:数据预处理 WEB数据集成 候选子集 缺失点 信息熵
分 类 号:TP3[自动化与计算机技术—计算机科学与技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.7