检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:高菲 宋韶旭[1,2,3] 王建民 GAO Fei;SONG Shao-Xu;WANG Jian-Min(School of Software,Tsinghua University,Beijing 100084,China;National Engineering Laboratory for Big Data Software,Beijing 100084,China;Beijing National Research Center for Information Science and Technology(Tsinghua University),Beijing 100084,China)
机构地区:[1]清华大学软件学院,北京100084 [2]大数据系统软件国家工程实验室,北京100084 [3]北京信息科学与技术国家研究中心(清华大学),北京100084
出 处:《软件学报》2021年第3期689-711,共23页Journal of Software
基 金:国家重点研发计划(2019YFB1705301);国家自然科学基金(62072265,61572272,71690231)。
摘 要:为进一步优化推广大数据及人工智能技术,作为数据管理与分析的基础,数据质量问题日益成为相关领域的研究热点.通常情况下,数据采集及记录仪的物理故障或技术缺陷等会导致收集到的数据存在一定的错误,而异常错误会对后续的数据分析以及人工智能过程产生不可小视的影响,因此在数据应用之前,需要对数据进行相应的数据清洗修复.现存的平滑修复方法会导致大量原本正确的数据点过度修复为异常值,而基于约束的顺序依赖方法以及SCREEN方法等也因为约束条件较为单薄而无法对复杂的数据情况进行精确修复.基于最小修复原则,进一步提出了多区间速度约束下的时间序列数据修复方法,并采用动态规划方法来求解最优修复路径.具体来说,提出了多个速度区间来对时序数据进行约束,并根据多速度约束对各数据点形成一系列修复候选点,进而基于动态规划方法从中选取最优修复解.为验证上述方法的可行性和有效性,采用一个人工数据集、两个真实数据集以及一个带有真实错误的数据集在不同的异常率及数据量下对上述方法进行实验.由实验结果可知:相较于其他现存的修复方法,该方法在修复结果及时间开销方面均有着较好的表现.进一步,对多个数据集通过聚类及分类精确率的验证来表明数据质量问题对后续数据分析及人工智能的影响至关重要,本方法可以提升数据分析及人工智能结果的质量.As the basis of data management and analysis,data quality issues have increasingly become a research hotspot in related fields.Furthermore,data quality can optimize and promote big data and artificial intelligence technology.Generally,physical failures or technical defects in data collection and recorder will cause certain anomalies in collected data.These anomalies will have a significant impact on subsequent data analysis and artificial intelligence processes,thus,data should be processed and cleaned accordingly before application.Existing repairing methods based on smoothing will cause a large number of originally correct data points being over-repaired into wrong values.And the constraint-based methods such as sequential dependency and SCREEN cannot accurately repair data under complex conditions since the constraints are relatively simple.A time series data repairing method under multi-speed constraints is further proposed based on the principle of minimum repairing.Then,dynamic programming is used to solve the problem of data anomalies with optimal repairing.Specifically,multiple speed intervals are proposed to constrain time series data,and a series of repairing candidate points is formed for each data point according to the speed constraints.Next,the optimal repair solution is selected from these candidates based on the dynamic programming method.In order to verify the feasibility and effectiveness of this method,an artificial data set,two real data sets,and another real data set with real anomalies are used for experiments under different rates of anomalies and data sizes.It can be seen from the experimental results that,compared with the existing methods based on smoothing or constraints,the proposed method has better performance in terms of RMS error and time cost.In addition,the verification of clustering and classification accuracy with several data sets shows the impact of data quality on subsequent data analysis and artificial intelligence.The proposed method can improve the quality of data analysi
分 类 号:TP311[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.218.131.147