基于PCA-XGBoost方法的洲际人群生物地理祖源推断模型研究  

Research on The Intercontinental Population Biogeographic Ancestral Inference Model Based on PCA-XGBoost Method

在线阅读下载全文

作  者:姚昊天 江丽 王春年 范虹[1] 李彩霞 YAO Hao-Tian;JIANG Li;WANG Chun-Nian;FAN Hong;LI Cai-Xia(School of Computer Science,Shaanxi Normal University,Xi’an 710119,China;Key Laboratory of Forensic Genetics,Beijing Engineering Research Center of Crime Scene Evidence Examination,National Engineering Laboratory for Forensic Science,Institute of Forensic Science,Beijing 100038,China)

机构地区:[1]陕西师范大学计算机科学学院,西安710119 [2]公安部鉴定中心,法医遗传学公安部重点实验室,北京市现场物证检验工程技术研究中心,现场物证溯源技术国家工程实验室,北京100038

出  处:《生物化学与生物物理进展》2024年第12期3292-3309,共18页Progress In Biochemistry and Biophysics

基  金:国家重点研发计划(2022YFC3341004);国家自然科学基金(82171870);陕西省自然科学基金(2022ZJ-39);法医遗传学公安部重点实验室开放课题(2023FGKFKT01);公安部鉴定中心基本科研业务费专项资金(2022JB020)资助项目。

摘  要:目的 通过DNA推断个体的生物地理祖源(biogeographical ancestry,BGA)在人类学、法医学等领域广受关注。目前常用方法是使用几十个祖先信息单核苷酸多态性(single nucleotide polymorphism,SNP)位点,通过主成分分析(principal component analysis,PCA)、似然比(likelihood ratio,LR)等方法判断个体的祖源。伴随高通量测序技术的发展,批量获取人群样本的高密度SNP数据集变得容易,同时计算机领域中机器学习等技术的引入,使得BGA研究发展出新的变化。本研究旨在构建适应高密度SNP数据,且具有高准确率和良好泛化能力的BGA推断模型。方法 首先基于307 866个SNP的数据,使用机器学习领域中的监督学习模型XGBoost,构建了基于多维度主成分(principal component,PC)的PCA-XGBoost推断模型,其次基于LR对推断结果进行评估和优化模型,确定了最佳PC数目和模型训练轮数,最后在其他公共数据的测试集上进一步验证模型的表现。结果 基于LR的结果评估方法,模型在参考集中人群预测准确率可以达到95%以上,在测试集中准确率可以达到90%以上,结论 PCA-XGBoost模型具有较高的洲际人群预测准确性,基于LR的结果评估方法有助于对预测结果的可靠性进行进一步评估。该模型具有很好的泛化能力,更换参考集的人群数据后,有望实现更加精细的人群分析。Objective The inference of biogeographical ancestry(BGA)using DNA is a significant focus within anthropology and forensic science.Current methods often utilize dozens of ancestry-informative SNPs,employing principal component analysis(PCA)and likelihood ratios(LR)to ascertain individual ancestries.Nonetheless,the selection of these SNPs tends to be population-specific and shows limitations in population differentiation.With the development of high-throughput sequencing technologies,acquiring high-density SNP datasets has become easier,challenging traditional statistical models which are often reliant on prior assumptions and struggle with high-density genetic data.The integration of machine learning,which prioritizes data learning and algorithmic iteration over prior knowledge,has propelled forward new developments in BGA research.This study aims to construct a BGA inference model suitable for high-density SNP data,characterized by broad population applicability,higher accuracy,and strong generalization capabilities.Methods Initially,intersection sites of autosomes from the phase III data of the 1000 Genomes Project and commonly used commercial chips were selected to build a reference dataset after thorough site quality control and filtering.This dataset was analyzed using PCA and ADMIXTURE to study population clustering,ancestral component mixing,and genetic substructures.Utilizing spaces of different principal component(PC),combinations,this study visually assessed the PCs’capabilities to differentiate between continental and intercontinental populations.Following this,the study employed the supervised learning classification model XGBoost,establishing a multidimensional PC-based PCA-XGBoost model with hyperparameters set through ten-fold cross-validation and a greedy strategy.Subsequently,the model was optimized and evaluated based on the LR,considering accuracy and runtime to determine the optimal number of PCs and training rounds,culminating in the study’s optimal BGA inference model.Finally,the performa

关 键 词:生物地理祖源推断 监督学习 主成分分析 XGBoost模型 

分 类 号:TP312[自动化与计算机技术—计算机软件与理论] R89[自动化与计算机技术—计算机科学与技术] D919.2[医药卫生—法医学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象