基于参考的基因序列压缩算法综述

A Survey on Gene Sequence Compression Algorithms Based on Reference Sequences

作　　者：蔡佳威胡川王华进沈志宏[1,2] CAI Jiawei;HU Chuan;WANG Huajin;SHEN Zhihong(Computer Network Information Center,The Chinese Academy of Sciences,Beijing 100083,China;University of Chinese Academy of Sciences,Beijing 100049,China)

机构地区：[1]中国科学院计算机网络信息中心,北京100083 [2]中国科学院大学,北京100049

出　　处：《数据与计算发展前沿（中英文）》2024年第4期59-76,共18页Frontiers of Data & Computing

基　　金：国家重点研发计划项目“面向国家科学数据中心的基础软件栈及系统”(2021YFF0704200);中国科学院“十四五”网信专项工程建设项目“科学大数据工程(三期)”(CAS-WX2022GC-02)。

摘　　要：【背景】在过去的二十年里,DNA测序技术持续发展,海量生物序列数据的产生给数据存储、管理和传输带来了严峻的挑战。【目的】本文主要总结近十五年基于参考的基因序列压缩算法,以寻求加速生物数据共享和降低存储成本的方法。【方法】本文从算法的发展角度出发,按照不同算法所使用的关键技术和针对压缩优化的方案进行分类。通过实验验证当前主流算法的性能,揭示当前基于参考的压缩算法所存在的问题。提出一些值得探讨的研究方向,并对未来的研究方向进行了展望。【结果】本文分析了已有基于参考的基因序列压缩算法使用的技术,包括基于单核苷酸多态性、检测最大精确匹配、分段/分块处理和基于LZ77等技术。并对几种较著名的算法进行了复现,发现这些算法倾向于在基准数据集上表现出高压缩比,但在普通数据集上的压缩比普遍不高。【结论】目前已有的基于参考的基因序列压缩算法在理论上可以加速数据传输效率、节约存储成本,但是实用性存疑。须继续改进公共子序列匹配方式以提升对普通数据集的支持,增加预处理参考序列步骤以降低匹配时间开销。[Background]Over the past two decades,DNA sequencing technologies have continued to advance,leading to the generation of massive biological data and posing significant challenges to data storage,management,and transmission.[Objective]This paper aims to provide a comprehensive survey of reference-based gene sequence compression algorithms developed in the last fifteen years,seeking methods to expedite the sharing of biological data and reduce storage costs.[Methods]The paper classifies algorithms based on their development perspective,cate-gorizing them according to the key technologies employed and optimization strategies for compression.Performance verification experiments are conducted to reveal existing issues with current reference-based compression algorithms.The paper also proposes some research directions for further exploration and offers insights into future research.[Results]The analysis covers the technologies utilized by existing reference-based gene compression algorithms,including those based on single nucleotide polymorphisms,detection of maximum exact matches,segment/block processing,and LZ77-based techniques.Several well-known algorithms are reproduced,revealing their tendency to exhibit high compression ratios on benchmark datasets but generally lower compression ratios on ordinary datasets.[Conclusions]Theoretically,currently available reference-based gene sequence compression algorithms have the potential to accelerate data transmission efficiency and save storage costs.However,their practicality remains questionable.Further improvements are needed in matching common subsequences to enhance support to ordinary datasets and to reduce matching time overhead by introducing preprocessing steps for reference sequences.

关键词：参考序列基因压缩 DNA序列

分类号：Q811.4[生物学—生物工程] TP18[自动化与计算机技术—控制理论与控制工程]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于参考的基因序列压缩算法综述

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于参考的基因序列压缩算法综述

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索