基于序列相似性和Z曲线方法重注释原核生物蛋白编码基因

Comprehensive re-annotation of protein-coding genes for prokaryotic genomes by Z-curve and similarity-based methods

作　　者：刘硕曾志曾凡才杜萌泽 Shuo Liu;Zhi Zeng;Fancai Zeng;Mengze Du(School of Life Science and Technology,University of Electronic Science and Technology of China,Chengdu 611731,China;Department of Biochemistry and Molecular Biology,School of Basic Medicine,Southwest Medical University,Luzhou 646000,China)

机构地区：[1]电子科技大学生命科学与技术学院,成都611731 [2]西南医科大学基础医学院,分子生物与生物化学教研室,泸州646000

出　　处：《遗传》2020年第7期691-702,I0003,共13页Hereditas（Beijing)

基　　金：电子科技大学理科实力提升计划项目(编号:Y0301902610100202)资助。

摘　　要：随着测序技术的不断发展,产生了海量的基因组测序数据,极大地丰富了公共遗传数据资源。同时为了应对大量基因组数据的产生,基因组比较和注释算法、工具不断更新,使得联合多种注释工具得到更准确的蛋白编码基因的注释信息成为可能。目前公共数据库的原核生物基因组测序和装配有些是10多年前的,存在大量预测的功能未知的编码基因。为了提升美国国家生物信息中心(National Center for Biotechnology Information,NCBI)数据库中基因组的注释质量,本研究联合使用多种原核基因识别算法/软件和基因表达数据重注释1587个细菌和古细菌基因组。首先,利用Z曲线的33个变量从177个基因组原注释中识别获得3092个被过度注释为蛋白编码基因的序列;其次,通过同源比对为939个基因组中的4447个功能未知的蛋白编码基因注释上具体功能;最后,通过联合采用ZCURVE 3.0和Glimmer 3.02以及Prodigal这3种高精度的、广泛使用且基于算法不同而互补的基因识别软件来寻找漏注释基因。最终,从9个基因组中找到了2003个被漏注释的蛋白编码基因,这些基因属于多个蛋白质直系同源簇(clusters of orthologous groups of proteins, COG)。本研究使用新的工具并结合多组学数据重新注释早期测序的细菌和古细菌基因组,不仅为新测序菌株提供注释方法参考,而且这些重注释后得到的细菌基因序列也会对后续基础研究有所帮助。The development of sequencing technology has generated huge genomic sequencing information and largely enriched public genetic resources.To analyze such big data,the algorithms and tools for comparison and annotation of genomes are updated continually,enabling genome annotation with higher accuracy via various annotation tools.Many prokaryotic genomes in public database were sequenced and assembled more than a decade ago,and they contained multiple genes with unknown functions.To improve the current annotation for those genomes in NCBI,we re-annotate 1587 bacterial and archaeal genomes using multiple prokaryotic gene recognition algorithms/softwares and gene expression data.The 33 Z-curve variables were applied to recognize sequences that were over-annotated to genes of 1587 bacterial and archaeal genomes deposited in public databases,and a total of 3092 sequences belonging to 177 genomes were recognized as sequences over-annotated as protein-coding genes.Next,4447 protein-coding genes with unknown functions from 939 genomes were annotated with definite functions by similarity search.Finally,we recognized 2003 missed protein-coding genes that belong to known COG(clusters of orthologous groups of proteins)of nine genomes using three methods(ZCURVE 3.0,Glimmer 3.02 and Prodigal),which are accurate and frequently used for gene finding.Their algorithms are different and complementary.This is a comprehensive study for re-annotation of bacterial and archaeal genomes with new tools combining multi-omics data,which should provide a reference for annotation of newly sequenced strains,and also benefit further fundamental researches with the bacterial gene sequences obtained after re-annotation.

关键词：细菌重注释 Z曲线假定ORFs 非蛋白编码ORFs

分类号：Q811.4[生物学—生物工程]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于序列相似性和Z曲线方法重注释原核生物蛋白编码基因

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于序列相似性和Z曲线方法重注释原核生物蛋白编码基因

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索