高通量计算在大规模人群队列基因组数据解析应用中的挑战  被引量:1

Challenges of High-Throughput Computing in Genomic Data Analysis for Large-Scale Cohort Studies

在线阅读下载全文

作  者:曾瀞瑶 苑娜 魏文娟 李根 杜政霖 Zeng Jingyao;Yuan Na;Wei Wenjuan;Li Gen;Du Zhenglin(National Genomics Data Center,Beijing Institute of Genomics,Chinese Academy of Sciences,Bejing 100101,China;Genetalks Biotechnology Co.,Ltd.,Changsha,Hunan 410152,China)

机构地区:[1]中国科学院北京基因组研究所,国家基因组科学数据中心,北京100101 [2]人和未来生物科技(长沙)有限公司,湖南长沙410152

出  处:《数据与计算发展前沿》2020年第1期117-127,共11页Frontiers of Data & Computing

基  金:国家重点研发计划“疾病研究精准医学知识库构建”(2016YFC0901900);国家重点研发计划“基于国家高性能计算环境的生物医药应用服务社区”(2016YFB0201700)。

摘  要:【目的】为推动精准医学研究的发展,世界各国相继开展大规模人群队列基因组测序计划,通过对数以万计个体进行全基因组测序,构建人群特异的基因组变异图谱。这些海量基因组数据产出,对计算速度和计算通量提出了新的要求,迫切需要速度更快、通量更高的计算平台来处理与解读这些生物序列信息。由于基因组数据自身的特点、数据解析过程的多样性和复杂性,致使在大规模人群基因组变异解析中高通量计算资源的使用效率低、计算速度慢、耗时长,服务器与本地数据交换不便,因此需要针对基因组变异解析进行多方面优化,通过软硬件开发来解决应用中存在的多种问题。本文拟对这些优化方法进行分析和综述。【方法】在高通量计算系统中,系统IO瓶颈问题是基因组变异解析并行化效率低的主要原因,通常采用基于分布式非结构化存储数据库以及对象存储系统,以提升IO的大规模可扩展能力,解决分析流程中存在的IO问题;同时通过基因组数据的高效压缩算法,可减少数据IO和传输压力。为了加快基因组数据解析速度,可在软件上采用神经网络等算法优化基因组解析方法,在硬件上使用FPGA(现场可编程逻辑门阵列)或GPU异构计算,以提高数据处理速度。【结果】综合来看,以上多方面的优化可以大幅提升基因组数据分析中高通量计算的性能,解决基因组数据处理中的存储墙问题,提高高通量计算资源的使用效率,大大减少全基因组变异解析的计算时间。【结论】高通量计算在基因组数据解析应用中存在的多种问题,可通过软硬件开发和优化得以解决,从而显著改进高通量计算在大规模人群队列变异解析应用中的计算效率,促进今后人群队列基因组研究与应用的广泛开展。[Objective]In order to promote the precision medicine research,large-scale population genomic studies have been carried out globally,and population-specific genome variation maps have been built by whole genome sequencing of thousands of individuals.These projects output massive genomic data,which needs high-throughput computing(HTC)to process.However,due to the characteristics of genomic data and the diversity and complexity of process workflows,HTC computing resources are not fully utilized in genomic data analysis tasks,so that the computing speed is slow and the data exchange over servers is inconvenient.Therefore,it is necessary to optimize HTC platforms for genomic data analysis from software and hardware aspects.This paper analyzes and summarizes these optimization methods.[Methods]In an HTC system,the bottleneck of system IO is the main cause for the low parallelization efficiency in genomic data processing.Generally,distributed unstructured storage database and object storage system are used to improve the scalability of large-scale IO and solve the IO problems in data processing.Meanwhile,the IO load can be reduced by using the efficient compression algorithms of genomic data.In order to accelerate genomic data processing,algorithms such as neural networks can be used to optimize genome analysis methods,and FPGA or GPU heterogeneous computing can be used to improve the speed of data analysis.[Results]In brief,the above optimization can greatly improve HTC performance by solving the IO wall problem in genomic data analysis and improving the efficiency of HTC resources,which greatly reduces the computing time of genome-wide variation analysis.[Conclusions]The software and hardware improvements can significantly increase the HTC efficiency and speed in genomic data analysis,and can promote the application of highthroughput computing on large-scale cohort studies in the future.

关 键 词:高通量计算 IO性能 基因组变异解析 异构加速 数据压缩 

分 类 号:Q811.4[生物学—生物工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象