Analysis on n-gram statistics and linguistic features of whole genome protein sequences  

Analysis on n-gram statistics and linguistic features of whole genome protein sequences

在线阅读下载全文

作  者:董启文 王晓龙 林磊 

机构地区:[1]School of Computer Science and Technology,Harbin Institute of Technology

出  处:《Journal of Harbin Institute of Technology(New Series)》2008年第5期694-698,共5页哈尔滨工业大学学报(英文版)

基  金:Sponsored by the National Natural Science Foundation of China(Grant No.60435020)

摘  要:To obtain the statistical sequence analysis on a large number of genomic and proteomic sequences available for different organisms, the n-grams of whole genome protein sequences from 20 organisms were extracted. Their linguistic features were analyzed by two tests: Zipf power law and Shannon entropy, developed for analysis of natural languages and symbolic sequences. The natural genome proteins and the artificial genome proteins were compared with each other and some statistical features of n-grams were discovered. The results show that: the n-grams of whole genome protein sequences approximately follow the Zipf law when n is larger than 4; the Shannon n-gram entropy of natural genome proteins is lower than that of artificial proteins; a simple uni-gram model can distinguish different organisms; there exist organism-specific usages of "phrases" in protein sequences. It is suggested that further detailed analysis on n-gram of whole genome protein sequences will result in a powerful model for mapping the relationship of protein sequence, structure and function.To obtain the statistical sequence analysis on a large number of genomic and proteomic sequences available for different organisms, the n-grams of whole genome protein sequences from 20 organisms were extracted. Their linguistic features were analyzed by two tests : Zipf power law and Shannon entropy, developed for analysis of natural languages and symbolic sequences. The natural genome proteins and the artificial genome proteins were compared with each other and some statistical features of n-grams were discovered. The results show that: the n-grams of whole genome protein sequences approximately follow the Zipf law when n is larger than 4 ; the Shannon n-gram entropy of natural genome proteins is lower than that of artificial proteins; a simple unigram model can distinguish different organisms ; there exist organism-specific usages of “phrases” in protein sequences. It is suggested that further detailed analysis on n-gram of whole genome protein sequences will result in a powerful model for mapping the relationship of protein sequence, structure and function.

关 键 词:n-gram statistics protein sequence Zipf law 

分 类 号:Q517[生物学—生物化学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象