基于多模型融合的不完整数据分数插补算法被引量：1

Fractional Imputation Algorithm for Incomplete Data Based on Multi-Model Fusion

作　　者：邵良杉[1,2] 赵松泽 SHAO Liangshan;ZHAO Songze(School of Software,Liaoning Technical University,Huludao 125105,Liaoning,China;Institute of Systems Engineering,Liaoning Technical University,Huludao 125105,Liaoning,China)

机构地区：[1]辽宁工程技术大学软件学院,辽宁葫芦岛125105 [2]辽宁工程技术大学系统工程研究所,辽宁葫芦岛125105

出　　处：《计算机工程》2023年第9期79-88,98,共11页Computer Engineering

基　　金：国家自然科学基金(71771111)。

摘　　要：缺失数据插补是从不完整数据集中进行数据挖掘的重要步骤,现有插补算法无法有效利用高缺失率的样本,存在等效处理缺失率不同的样本、假设缺失数据与完整数据同分布问题。构建基于多模型融合的不完整数据分数插补算法FIB。根据噪声标签学习,提出新的样本评分方式,以输出样本分数,通过建立机器学习模型将该分数作为分数样本权重,减小不可靠样本对模型性能的影响,并借鉴伪标签技术,使用高缺失率样本生成伪标签数据。将伪标签数据扩充至插补结果,形成待合并的单元插补结果,利用多个插补算法将单元插补结果融合生成最终插补结果。在12个公开UCI数据集上的实验结果表明,相比传统插补算法,使用样本评分、生成伪标签数据及多模型融合这3种新技术使插补效果分别平均相对提升2.35%、5.89%及7.78%,相比DIM,FIB的平均准确率相对提升8.39%。此外,随着模型个数的增加,插补效果也会相应增加,对于分类任务,5个模型融合的插补效果比2个模型的准确率平均相对提升11%,对于回归任务,R2得分平均相对提升15%。Missing data imputation is an important step in data mining from incomplete datasets.Existing imputation algorithms cannot effectively utilize samples with high missing rates,which results in the equivalent processing of samples with different missing rates,assuming that missing and complete data are distributed identically.An incomplete data fractional imputation algorithm FIB based on multi-model fusion is constructed.Based on noise label learning,a new sample scoring method is proposed to output sample scores.Subsequently,a machine learning model is established to use this score as the weight of the score sample,to reduce the impact of unreliable samples on model performance.Using pseudo-label technology as reference,high missing rate samples are then used to generate pseudo-label data.The pseudo-label data are further expanded to the imputation results to determine the unit imputation results to be merged,whereby multiple imputation algorithms are used to fuse the unit imputation results to generate the final imputation result.The experimental results on 12 publicly available UCI datasets show that,on the basis of traditional imputation algorithms,the three new technologies:sample scoring,generating pseudo-label data,and multi-model fusion,affording average relative improvements of 2.35%,5.89%,and 7.78%,respectively.Compared with DIM,the average accuracy of FIB is relatively improved by 8.39%.In addition,as the number of models increases,the imputation effect also increases.For classification tasks,the imputation effect of five-model fusion provides an average relative improvement of 11%compared to the average accuracy of two models,and for regression tasks,the R2 score is an average relative improvement of 15%.

关键词：缺失数据插补多模型融合伪标签噪声标签学习数据挖掘

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于多模型融合的不完整数据分数插补算法被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于多模型融合的不完整数据分数插补算法 被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于多模型融合的不完整数据分数插补算法被引量：1