机构地区:[1]河南农业大学信息与管理科学学院,河南郑州450046 [2]新里斯本大学坎波利德校园NOVA信息管理学院,葡萄牙里斯本1070-312
出 处:《河南农业大学学报》2025年第2期316-325,共10页Journal of Henan Agricultural University
基 金:河南省科技研发计划联合基金项目(应用攻关类)(242103810028);河南省科技攻关项目(252102520037);河南省重点研发专项(251111211300,231111110100,231111211300);河南省中央引导地方科技发展项目(Z20231811005)。
摘 要:【目的】设计一种基于FIML和DAE的填充缺失值的方法,即聚类全信息选择性过滤编码器数据填补算法(clustering-based comprehensive information selective filtering encoder data imputation algorithm,CFSM-DAE),为水稻种质资源缺失数据进行填充。【方法】利用聚类辅助避免数据异常值对算法的影响,采用选择性过滤层用于识别高质量估算、减少低质量估算的影响。传统的DAE框架通常没有选择性过滤层,所有的估算值都被视为同等重要,无法区分高质量和低质量的估算值。为了进一步提高估算精度,研究采用集成框架将全信息最大似然性(FIML)与多对抗性自编码器(DAE)结合的方法(CFSM-DAE),在选择性过滤层基础上,自适应填充,即当估算值不符合设定阈值时,采用FIML填充策略以确保填充结果的稳定性和精确度,从而进一步来提高整体估算精度。在3种缺失数据机制(随机缺失(MAR)、完全随机缺失(MCAR)和非随机缺失(MNAR))下对模拟数据和实际水稻种质资源数据集进行研究,将CFSM-DAE方法与多种常用填充算法比较(全信息最大似然性(FIML)、对抗自编码器(DAE)、K近邻填充(KNN)、随机森林(RF)、链式方程多重插补(MICE))。【结果】CFSM-DAE在模拟数据上的表现为S_(RME)=0.0676,E_(MA)=0.0093,R^(2)=0.9958;在水稻种质资源数据上的表现为S_(RME)=0.0395,E_(MA)=0.0078,R^(2)=0.8913。相比之下,其他算法如DAE在这两类数据下的SRME表现分别为0.8896和0.7707;KNN算法的EMA表现分别为0.1183和0.1305;FIML算法的R2表现为0.3382和0.7321。因此,CFSM-DAE在多个评价指标上相较于其他算法都表现出了一定的提升,CFSM-DAE在模拟数据和水稻种质资源数据的表现优于其他算法。【结论】CFSM-DAE方法通过结合聚类、选择性过滤和全信息最大似然性等策略,显著提高了水稻种质资源数据中缺失值的填补精度,展示了其在处理复杂缺失值问题上的有效性和潜力。【Objective】A method for imputing missing values was designed based on FIML and DAE,termed as the clustering-based comprehensive information selective filtering encoder data imputation algorithm(CFSM-DAE),for filling missing data in rice germplasm resources.【Method】Firstly,this paper used clustering to help avoid the impact of outliers on the algorithm.Then,a selective filtering layer was employed to identify high-quality estimates and minimize the influence of lower-quality ones.Traditional DAE frameworks typically lack such a selective filtering layer,treating all estimates as equally significant,thereby failing to distinguish between high and low-quality estimates.To further enhance estimation accuracy,this paper introduces an integrated framework that combines Full Information Maximum Likelihood(FIML)with Dual Adversarial Encoder(DAE),referred to as the CFSM-DAE method.Building upon the selective filtering layer,adaptive imputation is employed,wherein if the estimated values do not meet the set threshold,the FIML(full information maximum likelihood)imputation strategy is adopted to ensure the stability and accuracy of the imputation results,thereby further enhancing the overall estimation precision.Finally,we investigate simulated data and actual rice germplasm dataset under three missing data mechanisms(missing completely at random(MAR),missing completely at random(MCAR),and Not Missing at Random(MNAR)).The CFSM-DAE method is compared with several common imputation algorithms(full information maximum likelihood(FIML),discriminative adversarial encoder(DAE),K-Nearest neighbors(KNN),random forest(RF),and multiple imputation by chained equations(MICE).【Result】The proposed CFSM-DAE method significantly outperforms other algorithms in both simulated data and rice germplasm datasets.Specifically,CFSM-DAE achieved S_(RME)=0.0676,E_(MA)=0.0093,R^(2)=0.9958 for simulated data,and S_(RME)=0.0395,E_(MA)=0.0078,R^(2)=0.8913 for rice germplasm resource data.In comparison,other algorithms like DAE algorithm,the
关 键 词:水稻种质资源 聚类 全信息最大似然性 对抗性自编码器 选择性过滤层 数据缺失
分 类 号:S127[农业科学—农业基础科学]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...