基于BERT和CNN的基因剪接位点识别  

Gene splice site identification based on BERT and CNN

在线阅读下载全文

作  者:左敏[1,2] 王虹 颜文婧 张青川 ZUO Min;WANG Hong;YAN Wenjing;ZHANG Qingchuan(National Engineering Research Centre for Agri-Product Quality Traceability,Beijing Technology and Business University,Beijing 100048,China;School of E-Business and Logistics,Beijing Technology and Business University,Beijing 100048,China)

机构地区:[1]北京工商大学农产品质量安全追溯技术及应用国家工程研究中心,北京100048 [2]北京工商大学电商与物流学院,北京100048

出  处:《计算机应用》2023年第10期3309-3314,共6页journal of Computer Applications

基  金:国家自然科学基金项目资助项目(61873027)。

摘  要:随着高通量测序技术的发展,海量的基因组序列数据为了解基因组的结构提供了数据基础。剪接位点识别是基因组学研究的重要环节,在基因发现和确定基因结构方面发挥着重要作用,且有利于理解基因性状的表达。针对现有模型对脱氧核糖核酸(DNA)序列高维特征提取能力不足的问题,构建了由BERT(Bidirectional Encoder Representations from Transformer)和平行的卷积神经网络(CNN)组合而成的剪接位点预测模型——BERT-splice。首先,采用BERT预训练方法训练DNA语言模型,从而提取DNA序列的上下文动态关联特征,并且使用高维矩阵映射DNA序列特征;其次,采用人类参考基因组序列hg19数据,使用DNA语言模型将该数据映射为高维矩阵后作为平行CNN分类器的输入进行再训练;最后,在上述基础上构建了剪接位点预测模型。实验结果表明,BERT-splice模型在DNA剪接位点供体集上的预测准确率为96.55%,在受体集上的准确率为95.80%,相较于BERT与循环卷积神经网络(RCNN)构建的预测模型BERT-RCNN分别提高了1.55%和1.72%;同时,在5条完整的人类基因序列上测试得到的所提模型的供体/受体剪接位点平均假阳性率(FPR)为4.74%。以上验证了BERT-splice模型用于基因剪接位点预测的有效性。With the development of high-throughput sequencing technology,massive genome sequence data provide a data basis to understand the structure of genome.As an essential part of genomics research,splice site identification plays a vital role in gene discovery and determination of gene structure,and is of great importance for understanding the expression of gene traits.To address the problem that existing models cannot extract high-dimensional features of DNA(DeoxyriboNucleic Acid)sequences sufficiently,a splice site prediction model consisted of BERT(Bidirectional Encoder Representations from Transformers)and parallel Convolutional Neural Network(CNN)was constructed,namely BERTsplice.Firstly,the DNA language model was trained by BERT pre-training method to extract the contextual dynamic association features of DNA sequences and map DNA sequence features with a high-dimensional matrix.Then,the DNA language model was used to map the human reference genome sequence hg19 data into a high-dimensional matrix,and the result was adopted as input of parallel CNN classifier for retraining.Finally,a splice site prediction model was constructed on the basis of the above.Experimental results show that the prediction accuracy of BERT-splice model is 96.55%on the donor set of DNA splice sites and 95.80%on the acceptor set,which improved by 1.55%and 1.72%respectively,compared to that of the BERT and Recurrent Convolutional Neural Network(RCNN)constructed prediction model BERTRCNN.Meanwhile,the average False Positive Rate(FPR)of donor/acceptor splice sites tested on five complete human gene sequences is 4.74%.The above verifies that the effectiveness of BERT-splice model for gene splice site prediction.

关 键 词:剪接位点识别 BERT 卷积神经网络 深度学习 脱氧核糖核酸 

分 类 号:TP399[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象