面向K-近邻学习模型的高效数据清洗框架  被引量:3

Efficient Data Cleaning Framework for K-Nearest Neighbor Learning Models

在线阅读下载全文

作  者:王婧怡 陈胤佳 袁野 陈辰 王国仁 WANG Jingyi;CHEN Yinjia;YUAN Ye;CHEN Chen;WANG Guoren(School of Computer Science&Technology,Beijing Institute of Technology,Beijing 100081,China;School of Computer Science&Engineering,Beihang University,Beijing 100191,China)

机构地区:[1]北京理工大学计算机学院,北京100081 [2]北京航空航天大学计算机学院,北京100191

出  处:《计算机科学与探索》2023年第9期2241-2251,共11页Journal of Frontiers of Computer Science and Technology

基  金:国家自然科学基金(61932004,61732003,U2001211)。

摘  要:现实世界中收集的数据集通常是含有缺失的,为了在不完备数据集上构建有效的机器学习模型,需要对数据集进行清洗。为了确保较好的清洗效果,通常需要人工参与,从而导致大量成本。确定不完备数据的清洗优先级将有助于减小清洗规模,节约人工成本。而计算不完备数据的清洗优先级应确定其对模型性能的贡献。夏普利值是目前流行的用来评估数据在机器学习模型中贡献的方法,因此可以借助夏普利值的概念计算不完备数据的清洗优先级。由于现有工作缺少对不完备数据夏普利值的研究,首先基于不完备数据集的指数级的所有可能世界定义了一种不完备数据夏普利值的表示方法;然后基于K-近邻分类模型的效用函数,提出了一种多项式时间内计算不完备数据在K-近邻分类模型中夏普利值的近似算法;最后提出了一种基于夏普利值的面向K-近邻分类模型的启发式数据清洗算法ShapClean。实验表明,该算法在清洗后模型分类准确率方面往往可以明显超过现有的针对机器学习模型的自动清洗算法,而且相比同样需要人工参与的数据清洗算法,该方法具有更高的清洗效率,可以有效节约人工成本,同时保证理想的模型准确度。Real-world datasets are often collected with missing data,and in order to build effective machine learning models on incomplete datasets,the datasets need to be cleaned.To ensure the quality of the cleaned datasets,human involvement is often required,which incurs considerable costs.Prioritizing the cleaning of incomplete data will help minimize cleaning scale and save labor costs.Calculating the priority needs determining the contribution of the incomplete data to the performance of the model.Shapley value is a popular method for evaluating the contribution,so it can be used to calculate the cleaning priority.Due to the lack of existing work on Shapley value of incomplete data,a representation of Shapley value of incomplete data is firstly defined based on the possible worlds of the datasets.And an approximation algorithm for calculating Shapley value of incomplete data in the K-nearest neighbor classification model in polynomial time is proposed based on the K-nearest neighbor utility.Finally,the ShapClean,a heuristic data cleaning algorithm based on Shapley value,is proposed.Experiments show that the algorithm can often significantly exceed the existing automatic cleaning algorithms in terms of the accuracy.And compared with data cleaning algorithms that also require human involvement,the ShapClean can save more labour costs while ensuring the desired model accuracy.

关 键 词:不完备数据集 夏普利值 K-近邻(KNN) 清洗优先级 数据清洗 

分 类 号:TP399[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象