利用HiFi读数和k-mer分布特征的序列组装方法  

Sequence Assembly Method Utilizing HiFi Reads and k-mer Distribution Features

在线阅读下载全文

作  者:翟海霞[1] 蔡文达 刘小燕[1] 罗军伟[1] ZHAI Haixia;CAI Wenda;LIU Xiaoyan;LUO Junwei(School of Software,Henan Polytechnic University,Jiaozuo 454003,China)

机构地区:[1]河南理工大学软件学院,河南焦作454003

出  处:《小型微型计算机系统》2024年第6期1376-1383,共8页Journal of Chinese Computer Systems

基  金:国家自然科学基金面上项目(61972134)资助;河南省科技攻关项目(192102210118)资助;河南理工大学创新型科研团队项目(T2021-3)资助;河南理工大博士基金项目(B2018-36)资助。

摘  要:序列组装是利用测序技术得到的读数(read/序列片段)恢复完整基因组序列的方法.当前,第三代测序技术,如单分子实时(SMRT)测序技术和牛津纳米孔技术(ONT),能够产生长度超过10kbp的读数,从而可以解决序列组装中的重复区问题.但是,其测序错误率高达10~15%,使获得完整和准确的基因组序列仍然是一项具有挑战性的工作.最近出现的PacBio HiFi测序技术可以产生长度长(> 10kbp)并且准确性高(> 99.9%)的读数,促进了序列组装领域的研究发展.但是,如何充分利用HiFi读数的优势,开发设计高效的序列组装方法是当前的研究热点.alphaHiASM是一种新的基于HiFi读数和k-mer分布特征的序列组装方法,该方法首先针对HiFi读数长度长和准确性高的优势,设计了一种基于k-mer分布特征的HiFi读数重叠区检测策略.然后,根据上一步检测到的HiFi读数之间的重叠区,以HiFi读数为节点,构建读数重叠图,并提出一种可以衡量重叠区可信程度的边权重计算方法.接着,在该读数重叠图中,抽取路径,形成初始组装结果.最后对初始组装结果进行优化纠错,形成最终结果.该方法在4组不同的真实数据集上和当前流行的组装方法Flye,HiCanu和miniasm进行了性能比较.实验结果表明,alphaHiASM在组装完整性和正确性具有一定的优势.Sequence assembly is a method to recover the complete genome sequence using reads obtained from sequencing technologies.Currently,third-generation sequencing technologies,such as single-molecule real-time(SMRT)sequencing and Oxford Nanopore Technology(ONT),could generate reads longer than 10 kbp,which can solve the problem of repetitive regions in sequence assembly.However,their sequencing error rates are as high as 10~15%,it is still a challenging work to obtain complete and accurate genome sequences.Recently,there appear a novel sequencing technology,PacBio HiFi,which can generate long reads(>10 kbp)and higher per-base accuracy(>99.9%).It has facilitated the research development in the field of sequence assembly.However,determining how to take full advantage of HiFi reads and develop efficient sequence assembly tools is a current research hotspot.alphaHiASM is a new sequence assembly method based on HiFi reads and k-mer distribution features.This method designs a HiFi read overlap region detection strategy based on k-mer distribution features for the advantages of long length and high accuracy of HiFi reads,and it constructs an overlap graph with HiFi reads as its nodes based on the overlap region between HiFi reads detected in the previous step.We propose an edge weight calculation method that can measure the credibility of the overlap region.Then,alphaHiASM extracts paths in this overlap graph to generate the initial assembly result.Finally,the initial assembly results are optimized and corrected to form the final results.The performance of this method is compared with the current popular assembly methods such as Flye,HiCanu and miniasm on four different real data sets.The experimental results show that alphaHiASM has advantages in improving assembly completeness and correctness.

关 键 词:序列组装 HiFi读数 三代测序技术 

分 类 号:TP319[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象