DNAVec:基因组DNA序列的预训练词向量表示

DNAVec: Pre-Trained Word Vector Representation of Genomic DNA Sequences

出　　处：《生物医学》2021年第3期121-128,共8页Hans Journal of Biomedicine

摘　　要：破译DNA序列所代表的信息是基因组研究的基本问题之一。基因调控编码由于存在多义性关系而变得非常复杂,而以往的生物信息学方法往往无法捕捉到DNA序列的隐含信息,尤其是在数据匮乏的情况下。因而从序列信息中预测DNA序列的结构和功能是计算生物学的一个重要挑战。为了应对这一挑战,我们引入了一种新的方法,通过使用自然语言处理领域的语言模型BERT将DNA序列表示为连续词向量。通过对DNA序列进行建模,BERT有效地从未标记的大数据中捕捉到了DNA序列中的序列特性。我们将DNA序列的这种新的嵌入表示称为DNAVec (DNA-to-Vector)。此外,我们可以从模型中提取出预训练的词向量用于表示DNA序列,用于其他序列级别的分类任务。Deciphering the information represented by DNA sequences is one of the fundamental problems of genomic research. Gene regulatory coding is complicated by the presence of polysense relationships, and previous bioinformatics methods often fail to capture the implicit information of DNA sequenc-es, especially when data are scarce. Predicting the structure and function of DNA sequences from sequence information is thus an important challenge in computational biology. To address this challenge, we introduce a new approach to represent DNA sequences as continuous word vectors by using the language model BERT from the field of natural language processing. By modelling DNA sequences, BERT effectively captures the sequence properties in DNA sequences from unlabelled big data. We refer to this new embedding representation of DNA sequences as DNAVec (DNA-to-Vector). In addition, we can extract pre-trained word vectors from the model for repre-senting DNA sequences for other sequence-level classification tasks.

关键词：BERT DNA序列预训练自然语言处理

分类号：G63[文化科学—教育学]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

DNAVec:基因组DNA序列的预训练词向量表示

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

DNAVec:基因组DNA序列的预训练词向量表示

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索