检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:左敏[1,2] 王虹 颜文婧 张青川 ZUO Min;WANG Hong;YAN Wenjing;ZHANG Qingchuan(National Engineering Research Centre for Agri-Product Quality Traceability,Beijing Technology and Business University,Beijing 100048,China;School of E-Business and Logistics,Beijing Technology and Business University,Beijing 100048,China)
机构地区:[1]北京工商大学农产品质量安全追溯技术及应用国家工程研究中心,北京100048 [2]北京工商大学电商与物流学院,北京100048
出 处:《计算机应用》2023年第10期3309-3314,共6页journal of Computer Applications
基 金:国家自然科学基金项目资助项目(61873027)。
摘 要:随着高通量测序技术的发展,海量的基因组序列数据为了解基因组的结构提供了数据基础。剪接位点识别是基因组学研究的重要环节,在基因发现和确定基因结构方面发挥着重要作用,且有利于理解基因性状的表达。针对现有模型对脱氧核糖核酸(DNA)序列高维特征提取能力不足的问题,构建了由BERT(Bidirectional Encoder Representations from Transformer)和平行的卷积神经网络(CNN)组合而成的剪接位点预测模型——BERT-splice。首先,采用BERT预训练方法训练DNA语言模型,从而提取DNA序列的上下文动态关联特征,并且使用高维矩阵映射DNA序列特征;其次,采用人类参考基因组序列hg19数据,使用DNA语言模型将该数据映射为高维矩阵后作为平行CNN分类器的输入进行再训练;最后,在上述基础上构建了剪接位点预测模型。实验结果表明,BERT-splice模型在DNA剪接位点供体集上的预测准确率为96.55%,在受体集上的准确率为95.80%,相较于BERT与循环卷积神经网络(RCNN)构建的预测模型BERT-RCNN分别提高了1.55%和1.72%;同时,在5条完整的人类基因序列上测试得到的所提模型的供体/受体剪接位点平均假阳性率(FPR)为4.74%。以上验证了BERT-splice模型用于基因剪接位点预测的有效性。With the development of high-throughput sequencing technology,massive genome sequence data provide a data basis to understand the structure of genome.As an essential part of genomics research,splice site identification plays a vital role in gene discovery and determination of gene structure,and is of great importance for understanding the expression of gene traits.To address the problem that existing models cannot extract high-dimensional features of DNA(DeoxyriboNucleic Acid)sequences sufficiently,a splice site prediction model consisted of BERT(Bidirectional Encoder Representations from Transformers)and parallel Convolutional Neural Network(CNN)was constructed,namely BERTsplice.Firstly,the DNA language model was trained by BERT pre-training method to extract the contextual dynamic association features of DNA sequences and map DNA sequence features with a high-dimensional matrix.Then,the DNA language model was used to map the human reference genome sequence hg19 data into a high-dimensional matrix,and the result was adopted as input of parallel CNN classifier for retraining.Finally,a splice site prediction model was constructed on the basis of the above.Experimental results show that the prediction accuracy of BERT-splice model is 96.55%on the donor set of DNA splice sites and 95.80%on the acceptor set,which improved by 1.55%and 1.72%respectively,compared to that of the BERT and Recurrent Convolutional Neural Network(RCNN)constructed prediction model BERTRCNN.Meanwhile,the average False Positive Rate(FPR)of donor/acceptor splice sites tested on five complete human gene sequences is 4.74%.The above verifies that the effectiveness of BERT-splice model for gene splice site prediction.
关 键 词:剪接位点识别 BERT 卷积神经网络 深度学习 脱氧核糖核酸
分 类 号:TP399[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.120