检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:崔晨[1] 李贵[1] 李征宇[1] 韩子扬[1] 曹科研
出 处:《数据挖掘》2021年第3期150-166,共17页Hans Journal of Data Mining
摘 要:异常值指的是数据中的噪声和不一致值。异常值检测与处理往往依赖于约束规则,通常的约束规则包括条件函数依赖、否定约束、编辑规则等。但对于特定领域,这些领域约束规则需要由领域专家制定,基于数据挖掘和机器学习算法,难以高效地发现这些领域约束规则。本文提出了一种用于数据清洗的反常项集的概念,与基于数据分布密度的异常值检测算法类似,反常项集是数据中不太可能出现的非常态取值组合。在此基础上,本文引入了加权调和提升度的概念及特性,利用改进的等价类变换算法挖掘低提升度的反常项集。并采用准反常项集对数据更正进行预计算,给出了一种类似于近邻插补算法的异常值更正算法,以保证异常值处理质量。在房地产信息数据集下的实验表明,基于反常项集的异常值检测与处理算法具有较高的精度,同时能够避免在数据修复中引入新的异常。Anomalies refer to the noise and inconsistent values in the data. The detection and processing of anomalies often depend on domain constraints, which usually include conditional functional de-pendencies, negative constraints and editing rules, etc. However, for specific domains, these domain constraint rules need to be made by domain experts, and it is difficult to find these domain con-straint rules efficiently based on data mining and machine learning algorithms. In this paper, a concept of abnormal itemset for data cleaning is proposed. Similar to the outlier detection algo-rithm based on data distribution density, abnormal itemset is an unlikely combination of abnormal values in data. Then, some characteristics of lifting degree are introduced to mine abnormal itemset with low lifting degree by using the improved equivalence class transformation algorithm. Fur-thermore, this paper proposes an anomalies repair algorithm similar to the nearest neighbor in-terpolation algorithm to ensure the repair quality. Experiments under the real estate information data set show that the anomalies detection and processing algorithm based on abnormal itemset have high accuracy and will not introduce new anomalies by data repairing.
分 类 号:TP3[自动化与计算机技术—计算机科学与技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.222