一种碱基精度的肿瘤基因组单体型异质性识别算法  被引量:2

An Algorithm with Base-Pair Resolution for Identifying Cancer Heterogeneity by Estimating Multiple Clonal Haplotypes

在线阅读下载全文

作  者:耿彧 赵仲孟[2,3] 刘建业 许静 崔代兵[2,3] 萧笑 王嘉寅 GENG Yu;ZHAO Zhongmeng;LIU Jianye;XU Jing;CUI Daibing;XIAO Xiao;WANG Jiayin(School of General Education, Jinzhou Medical University, Jinzhou, Liaoning 121001, China;Department of Computer Science and Technology, Xi~ an Jiaotong University, Xi 'an 710049, China;School of Management, Xi' an Jiaotong University, Xi'an 710049, China;Shaanxi Engineering Research Center of Medical and Health Big Data, Xi'an Jiaotong University, Xi'an 710049, China;State Key Laboratory of Cancer Biology, The Fourth Military Medical University, Xi'an 710032, China)

机构地区:[1]锦州医科大学公共基础学院,辽宁锦州121001 [2]西安交通大学计算机科学与技术系,西安710049 [3]西安交通大学陕西省医疗健康大数据工程研究中心,西安710049 [4]第四军医大学肿瘤生物学国家重点实验室,西安710032 [5]西安交通大学管理学院,西安710049

出  处:《西安交通大学学报》2017年第6期92-96,共5页Journal of Xi'an Jiaotong University

基  金:国家自然科学基金资助项目(81400632);陕西省自然科学基金资助项目(2014JM8350);中央高校基本科研业务费专项资金资助项目(GLIJ002)

摘  要:针对肿瘤组织的异质性的子克隆解析,提出了一种通过多级子克隆的体细胞突变模式来识别单体型异质性的算法。该算法基于肿瘤组织的多文库测序数据提取文库特征和双末端读段约束,通过对体细胞突变位点的等位基因变异频率进行聚类估算出子克隆数目的一个先验;同时设计了一种拼接识别算法,通过遍历位点对应的读段来拼接单体型序列,拼接出的单体型序列的精度为碱基水平;采用后验概率的最大似然估计解出子克隆的个数、配比及演化关系。仿真实验表明,当基础文库满足一定测序覆盖度时,该算法对单体型异质性的识别精度可达到99%以上,能够取代目前数据分析中常用的两步法,且获得高精确的识别结果。An algorithm for identifying haplotype heterogeneity in cancer genomes is proposed to consider somatic mutational events carried by multiple sub-clones.The algorithm is based on the genomic sequencing data with multiple libraries of tumor tissue and extracts the features from both the multi-library and the constraints of paired-end reads.A priori number of sub-clones is roughly estimated by clustering the allelic variant frequency of each somatic loci.A contig-andextension algorithm is designed,and the haplotype sequences are assembled by traversing the reads mapping to the loci.Thus,the contigs present an identification resolution on base-pair level.The number and proportion of sub-clones and the evolution relationships among them are further estimated by maximizing the likelihood of the posterior probabilities.Simulation resultsshow that the algorithm reaches 99%in accuracy when the sequencing based library satisfies some coverage.The proposed algorithm outperforms the existing two-stage pipeline,which is widely used in data analysis now.

关 键 词:肿瘤异质性 子克隆解析 单体型异质性 多文库测序数据 拼接识别算法 

分 类 号:TP399[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象