检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:邵良杉[1,2] 赵松泽 SHAO Liangshan;ZHAO Songze(School of Software,Liaoning Technical University,Huludao 125105,Liaoning,China;Institute of Systems Engineering,Liaoning Technical University,Huludao 125105,Liaoning,China)
机构地区:[1]辽宁工程技术大学软件学院,辽宁葫芦岛125105 [2]辽宁工程技术大学系统工程研究所,辽宁葫芦岛125105
出 处:《计算机工程》2023年第9期79-88,98,共11页Computer Engineering
基 金:国家自然科学基金(71771111)。
摘 要:缺失数据插补是从不完整数据集中进行数据挖掘的重要步骤,现有插补算法无法有效利用高缺失率的样本,存在等效处理缺失率不同的样本、假设缺失数据与完整数据同分布问题。构建基于多模型融合的不完整数据分数插补算法FIB。根据噪声标签学习,提出新的样本评分方式,以输出样本分数,通过建立机器学习模型将该分数作为分数样本权重,减小不可靠样本对模型性能的影响,并借鉴伪标签技术,使用高缺失率样本生成伪标签数据。将伪标签数据扩充至插补结果,形成待合并的单元插补结果,利用多个插补算法将单元插补结果融合生成最终插补结果。在12个公开UCI数据集上的实验结果表明,相比传统插补算法,使用样本评分、生成伪标签数据及多模型融合这3种新技术使插补效果分别平均相对提升2.35%、5.89%及7.78%,相比DIM,FIB的平均准确率相对提升8.39%。此外,随着模型个数的增加,插补效果也会相应增加,对于分类任务,5个模型融合的插补效果比2个模型的准确率平均相对提升11%,对于回归任务,R2得分平均相对提升15%。Missing data imputation is an important step in data mining from incomplete datasets.Existing imputation algorithms cannot effectively utilize samples with high missing rates,which results in the equivalent processing of samples with different missing rates,assuming that missing and complete data are distributed identically.An incomplete data fractional imputation algorithm FIB based on multi-model fusion is constructed.Based on noise label learning,a new sample scoring method is proposed to output sample scores.Subsequently,a machine learning model is established to use this score as the weight of the score sample,to reduce the impact of unreliable samples on model performance.Using pseudo-label technology as reference,high missing rate samples are then used to generate pseudo-label data.The pseudo-label data are further expanded to the imputation results to determine the unit imputation results to be merged,whereby multiple imputation algorithms are used to fuse the unit imputation results to generate the final imputation result.The experimental results on 12 publicly available UCI datasets show that,on the basis of traditional imputation algorithms,the three new technologies:sample scoring,generating pseudo-label data,and multi-model fusion,affording average relative improvements of 2.35%,5.89%,and 7.78%,respectively.Compared with DIM,the average accuracy of FIB is relatively improved by 8.39%.In addition,as the number of models increases,the imputation effect also increases.For classification tasks,the imputation effect of five-model fusion provides an average relative improvement of 11%compared to the average accuracy of two models,and for regression tasks,the R2 score is an average relative improvement of 15%.
关 键 词:缺失数据插补 多模型融合 伪标签 噪声标签学习 数据挖掘
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.188.123.155