基于FASTmrEMMA、最小角回归和随机森林的全基因组选择新算法  被引量:4

A new algorithm of genomics selection based on FASTmrEMMA,least angle regression and random forest

在线阅读下载全文

作  者:孙嘉利 吴清太[1] 温阳俊 张瑾[1] SUN Jiali;WU Qingtai;WEN Yangjun;ZHANG Jin(College of Sciences,Nanjing Agricultural University,Nanjing 210095,China)

机构地区:[1]南京农业大学理学院,江苏南京210095

出  处:《南京农业大学学报》2021年第2期366-372,共7页Journal of Nanjing Agricultural University

基  金:国家自然科学基金青年基金项目(31301229)。

摘  要:[目的]本研究将FASTmrEMMA、最小角回归(least angle regression,LARS)和随机森林(random forest,RF)方法应用于全基因组选择,以提高植物数量性状预测的准确性和效率,为植物遗传和育种提供有益信息。[方法]对拟南芥自然群体的模拟数据和真实数据进行全基因组预测。在模拟数据分析中,设置不同的表型缺失率,以平均绝对误差(mean absolute error,MAE)、均方误差(mean squared error,MSE)、预测模型拟合度和计算时间为指标,比较基于最小角回归和随机森林的两阶段算法(two-stage algorithm based on least angle regression and random forest,TSLRF)、基于随机森林的两阶段变量选择(two-stage stepwise variable selection based on random forest,TSRF)、随机森林和全基因组最佳线性无偏预测(genomic best linear unbiased prediction,GBLUP)4种方法的优劣。在拟南芥真实数据研究中,针对长日照花期(days to flowering under long day,LD)、春化长日照花期(days to flowering under long day with vernalization,LDV)和短日照花期(days to flowering under short day,SD)实施全基因组预测,并利用这些表型预测值与观测值进行全基因组关联分析,以比较上述4种全基因组选择方法的性能。[结果]模拟研究表明:在不同表型缺失率下,TSLRF的全基因组预测准确度和预测模型拟合度均较高;真实数据的TSLRF分析也获得相似的结论,且检测到40个已报道与目标性状显著关联的基因。[结论]TSLRF方法的全基因组预测准确度和模型拟合度较高,计算速度快,为分子育种和优异亲本组合的预测提供理论依据。[Objectives]In this study,the integrated methods of FASTmrEMMA,least angle regression(LARS)and random forest(RF)were used to conduct genomics selection,and its purpose was to improve the accuracy and efficiency of genomics selection and to provide useful information in plant genetics and breeding.[Methods]A series of simulated datasets,along with real Arabidopsis thaliana datasets were used to confirm the new method of genomics selection.In the simulation studies,the datasets with various phenotypic missing rates were analyzed by two-stage algorithm based on least angle regression and random forest(TSLRF),two-stage stepwise variable selection based on random forest(TSRF),RF and genomic best linear unbiased prediction(GBLUP)in order to compare their accuracies and running time.The accuracies were measured by mean absolute error(MAE),mean squared error(MSE),and the prediction model fit.Meanwhile,all the missing phenotypes of days to flowering under long day(LD),days to flowering under long day with vernalization(LDV)and days to flowering under short day(SD)in A.thaliana were predicted by the above four approaches.All the predicted and observed phenotypes were used to conduct genome-wide association studies in order to compare the performances of the above four methods.[Results]The results from simulation studies showed that TSLRF had better results in the accuracy,the fit of prediction model and running time under various phenotypic missing rates.This conclusion was further confirmed by real dataset analyses.In addition,40 known genes were found to be associated with the above three traits.[Conclusions]TSLRF has high prediction accuracy,good model fitting and fast computing speed.This study provides a theoretical basis for molecular breeding and prediction of elite parental combinations.

关 键 词:FASTmrEMMA 最小角回归 随机森林 多基因效应校正 全基因组选择 

分 类 号:Q943[生物学—植物学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象