检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:刘硕 曾志 曾凡才 杜萌泽 Shuo Liu;Zhi Zeng;Fancai Zeng;Mengze Du(School of Life Science and Technology,University of Electronic Science and Technology of China,Chengdu 611731,China;Department of Biochemistry and Molecular Biology,School of Basic Medicine,Southwest Medical University,Luzhou 646000,China)
机构地区:[1]电子科技大学生命科学与技术学院,成都611731 [2]西南医科大学基础医学院,分子生物与生物化学教研室,泸州646000
出 处:《遗传》2020年第7期691-702,I0003,共13页Hereditas(Beijing)
基 金:电子科技大学理科实力提升计划项目(编号:Y0301902610100202)资助。
摘 要:随着测序技术的不断发展,产生了海量的基因组测序数据,极大地丰富了公共遗传数据资源。同时为了应对大量基因组数据的产生,基因组比较和注释算法、工具不断更新,使得联合多种注释工具得到更准确的蛋白编码基因的注释信息成为可能。目前公共数据库的原核生物基因组测序和装配有些是10多年前的,存在大量预测的功能未知的编码基因。为了提升美国国家生物信息中心(National Center for Biotechnology Information,NCBI)数据库中基因组的注释质量,本研究联合使用多种原核基因识别算法/软件和基因表达数据重注释1587个细菌和古细菌基因组。首先,利用Z曲线的33个变量从177个基因组原注释中识别获得3092个被过度注释为蛋白编码基因的序列;其次,通过同源比对为939个基因组中的4447个功能未知的蛋白编码基因注释上具体功能;最后,通过联合采用ZCURVE 3.0和Glimmer 3.02以及Prodigal这3种高精度的、广泛使用且基于算法不同而互补的基因识别软件来寻找漏注释基因。最终,从9个基因组中找到了2003个被漏注释的蛋白编码基因,这些基因属于多个蛋白质直系同源簇(clusters of orthologous groups of proteins, COG)。本研究使用新的工具并结合多组学数据重新注释早期测序的细菌和古细菌基因组,不仅为新测序菌株提供注释方法参考,而且这些重注释后得到的细菌基因序列也会对后续基础研究有所帮助。The development of sequencing technology has generated huge genomic sequencing information and largely enriched public genetic resources.To analyze such big data,the algorithms and tools for comparison and annotation of genomes are updated continually,enabling genome annotation with higher accuracy via various annotation tools.Many prokaryotic genomes in public database were sequenced and assembled more than a decade ago,and they contained multiple genes with unknown functions.To improve the current annotation for those genomes in NCBI,we re-annotate 1587 bacterial and archaeal genomes using multiple prokaryotic gene recognition algorithms/softwares and gene expression data.The 33 Z-curve variables were applied to recognize sequences that were over-annotated to genes of 1587 bacterial and archaeal genomes deposited in public databases,and a total of 3092 sequences belonging to 177 genomes were recognized as sequences over-annotated as protein-coding genes.Next,4447 protein-coding genes with unknown functions from 939 genomes were annotated with definite functions by similarity search.Finally,we recognized 2003 missed protein-coding genes that belong to known COG(clusters of orthologous groups of proteins)of nine genomes using three methods(ZCURVE 3.0,Glimmer 3.02 and Prodigal),which are accurate and frequently used for gene finding.Their algorithms are different and complementary.This is a comprehensive study for re-annotation of bacterial and archaeal genomes with new tools combining multi-omics data,which should provide a reference for annotation of newly sequenced strains,and also benefit further fundamental researches with the bacterial gene sequences obtained after re-annotation.
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.133.117.5