检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:王文迪[1,2] 汤文[1,2] 段勃[1,2] 张春明[1] 张佩珩[1] 孙凝晖[1,3]
机构地区:[1]中国科学院计算技术研究所高性能计算机研究中心,北京100190 [2]中国科学院大学,北京100049 [3]计算机体系结构国家重点实验室(中国科学院计算技术研究所),北京100190
出 处:《计算机研究与发展》2013年第11期2463-2471,共9页Journal of Computer Research and Development
基 金:国家"九七三"重点基础研究发展计划基金项目(2012CB316502);国家"八六三"高技术研究发展计划基金项目(2009AA01A129);中国科学院知识创新工程重大项目(KGCX1-YW-13);国家自然科学基金项目(60803030;60633040;60925009;60921002)
摘 要:近年来随着高通量基因测序技术的迅速发展,测序成本和周期都得到了大幅降低.然而,新一代测序技术海量数据生成能力以及各类测序算法蕴含的高并发性却对现有计算机的运算能力提出了新挑战.以一个基于Hash索引算法实现的开源重测序程序(PerM)为例,研究了在商用多核CPU上加速该应用程序的关键技术.在一个64核SMP系统上的实验结果证明,提出的优化技术可以使Cache缺失率降低90%,性能提升4~11倍.接下来探讨了在一个包含XilinxLX330FPGA的加速卡上设计实现专用并行加速系统的相关问题.作为原型验证系统,在基于FPGA的PCIe加速卡上设计并实现了包含11个处理单元的脉动陈列并行计算系统.和IntelXeonX75508核CPU相比,提出的并行加速器有30~65倍性能功耗比优势.In recent years, due to the rapid development of high-throughput next generation sequencing (NGS) technologies, the sequencing cost and time have been greatly reduced. However, both the explosion of the generated NGS data and the massively parallel computation pose great challenges to the capability of existing computers. We take an open-source re-sequencing algorithm based on hash-index, called PerM, as an example to investigate the optimizations for accelerating NGS with commercial multi-core CPUs as well as with customized parallel architectures. Firstly, we optimize the original algorithm by reordering the bucket accessing sequences so that data locality in shared cache is improved. Secondly, to exclude the empty hash buckets, we propose a hash-index compression algorithm, which coincides with the sequential access nature of the optimized algorithm. The experiments on a 64-cores SMP (Intel Xeon X7550) show that the optimized algorithm reduces LLC miss ratio to about 10% of the original algorithm, therefore the overall performance can be improved by 4 to 11 times. Furthermore, a parallel accelerator architecture is designed and evaluated on our customized FPGA accelerator card with a Xilinx LX330 FPGA resident. As a prototype, a systolic array of 100 PEs is built, "which operates at 175 MHz. The performance of the proposed parallel accelerator architecture is justified by the reported speedup of 30 to 65 times over an 8-cores CPU.
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.117