机构地区:[1]清华大学信息科学技术学院自动化系生物信息学研究所信息科学技术国家实验室,智能技术与系统国家重点实验室,生物信息学教育部重点实验室,北京100084 [2]北京大学医学部,北京100083
出 处:《Acta Genetica Sinica》2004年第5期431-443,共13页
基 金:国家自然科学基金资助项目 (编号 :3 0 2 70 3 42 )~~
摘 要:采用生物信息学分析与实验确认相结合的技术路线 ,通过所识别的基因在非冗余数据库比对发现了网上公布的计算机注释人类基因组编码序列存在各种类型的多处错误 ,包括cDNA水平的一个或一段碱基插入、缺失或突变 ,或是这些错误的不同排列组合 ,其中以错误插入为多 ,往往导致编码氨基酸的移码突变。最先举证了NCBIGENOMEAnnotationProject预测人类新基因的下列错误类型 :(1)开放读码框架 (ORF)中错误插入一个碱基造成编码氨基酸移码 ;(2 )错误拼接 ;(3)开放读框中错误插入一个或一段碱基造成该读框提前终止。只编码N 端氨基酸的cDNA序列而不完整 ;(4 )只有编码C 端氨基酸序列的cDNA而不完整 ;(5 )只是正确基因ORF中间的一段编码蛋白cDNA序列而不完整 ,缺N 端与C 端氨基酸序列 ,并且将不完整蛋白氨基酸序列的第一个非起始码氨基酸错误地预测为起始码氨基酸 ,如将L错误地预测为M ;(6 )开放读框中错误插入一个或一段碱基造成前面出现不该有的终止码 ,因而编码蛋白缺开头部分氨基酸 ;(7)可能将污染基因组序列当作完整基因cDNA序列对待而预测出所谓单一外显子基因。即便真是基因 ,也只是较长单一外显子mRNA中有一小ORF ,而ORF起始码上游同一相位确实存在终止码 ,无其他特点符合基因条件 ;(8)所预测基因只有ORF ,We found that human genome coding regions annotated by computers have different kinds of many errors in public domain through homologous BLAST of our cloned genes in non-redundant (nr) database,including insertions,deletions or mutations of one base pair or a segment in sequences at the cDNA level,or different permutation and combination of these errors.Basically,we use the three means for validating and identifying some errors of the model genes appeared in NCBI GENOME ANNOTATION PROJECT REFSEQS:(1)Evaluating the support degree of human EST clustering and draft human genome BLAST.(2)Preparation of chromosomal mapping of our verified genes and analysis of genomic organization of the genes.All of the exon/intron boundaries should be consistent with the GT/AG rule,and consensuses surrounding the splice boundaries should be found as well.(3)Experimental verification by RT-PCR of the in silico cloning genes and further by cDNA sequencing.And then we use the three means as reference:(1)Web searching or in silico cloning of the genes of different species,especially mouse and rat homologous genes,and thus judging the gene existence by ontology.(2)By using the released genes in public domain as standard,which should be highly homologous to our verified genes,especially the released human genes appeared in NCBI GENOME ANNOTATION PROJECT REFSEQS,we try to clone each a highly homologous complete gene similar to the released genes in public domain according to the strategy we developed in this paper.If we can not get it,our verified gene may be correct and the released gene in public domain may be wrong.(3)To find more evidence,we verified our cloned genes by RT-PCR or hybrid technique.Here we list some errors we found from NCBI GENOME ANNOTATION PROJECT REFSEQs:(1) Insert a base in the ORF by mistake which causes the frame shift of the coding amino acid.In detail,abase in the ORF of a gene is a redundant insertion,which causes a reading frame shift in the translation of an alternative protein,such as LOC124919 is wrong f
关 键 词:人类基因组 表达序列标签 计算机克隆 基因纠正 模式参考序列 生物信息学
分 类 号:TP392[自动化与计算机技术—计算机应用技术] Q987[自动化与计算机技术—计算机科学与技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...