面向云平台的二代测序数据近似去重方法研究被引量：4

Near de-duplication method of NGS sequence data oriented cloud platform

机构地区：[1]北京信息科技大学信息管理学院,北京100129 [2]首都医科大学附属北京地坛医院传染病研究所,北京100015

出　　处：《计算机工程与应用》2017年第23期1-5,共5页Computer Engineering and Applications

基　　金：国家自然科学基金(No.61572079);北京市教育委员会科技计划一般项目(No.KM201711232018)

摘　　要：新一代测序因其数据量大、数据处理过程复杂、对计算资源要求高等特点,需要通过云计算进行处理。然而,云计算的处理方式要求先将测序数据上传到云平台中。但由于测序过程的随机性,使得同一样本的两次测序、两个相似样本分别测序后所产生的文件在二进制层面会有较大差别。目前已有的去重方法无法有效识别出这样的"重复"测序文件和测序结果中的"重复"内容。重复上传和存储这些重复数据,不仅消耗网络带宽,而且浪费存储空间。针对现存的重复数据删除方法仅仅基于文件的二进制特征,并未有效利用测序结果数据相似性特点的问题,提出一种面向云平台的海量高通量测序数据近似去重方法NPD(Near Probability Deduplication)。该方法对Fast Q中的序列和质量信息,使用Sim Hash计算分块指纹,采用客户端与云平台双布谷过滤器(Cukoo Filter)对指纹值进行快速存在性检测,最后由云平台使用近似算法对指纹值近似去重。实验结果表明,NPD方法在保证高效的同时,大幅提升了去重率,进而减少了网络流量,缩短了数据上传时间,能够支撑海量数据处理,具有良好的实用价值。The next generation sequencing needs to be processed by cloud computing due to its large data volume,complex pipeline and high requirements of computing resources. Cloud computing approach necessitates that the sequencing data is uploaded to the cloud platform first. The randomness of the sequencing process results in great differences at the binary level even dealing with the same sample or two similar samples. Existing methods of deduplication do not effectively identify duplicate contents in such sequencing results. Uploading and storing these duplicate data not only consume network bandwidth, but also waste storage space. Aiming to the existing methods are based on the binary feature of file,not effectively use the similarity features of sequencing results, NPD(Near Probability Deduplication)method is proposed for massive high-throughput sequencing data oriented cloud platform. It uses Sim Hash to calculate the block fingerprints of sequence and quality information in Fast Q file, and then the fingerprints are quickly detected by double cuckoo filter of client and cloud platform. At the final stage, the cloud platform uses approximate algorithm to near deduplicate fingerprints. Experimental results show that the NPD method can improve deduplication ratio, reduce network traffic, shorten data upload time, and support massive data processing, and has a good practical value.

关键词：高通量测序重复数据删除近似去重布谷过滤器

分类号：TP311[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

面向云平台的二代测序数据近似去重方法研究被引量：4

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

面向云平台的二代测序数据近似去重方法研究 被引量：4

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

面向云平台的二代测序数据近似去重方法研究被引量：4