基于预训练模型的文博数据命名实体识别方法  被引量:1

Named entity recognition method for culture and museum data based on pre-training model

在线阅读下载全文

作  者:赵卓[1] 田侃[1] 张殊[1] 张晨 吴涛[2] 姜丰 游小琳 ZHAO Zhuo;TIAN Kan;ZHANG Shu;ZHANG Chen;WU Tao;JIANG Feng;YOU Xiaolin(Department of Heritage Information,Chongqing Three Gorges Museum,Chongqing 400015,China;School of Cyberspace Security and Information,Chongqing University of Posts and Telecommunications,Chongqing 400065,China)

机构地区:[1]重庆中国三峡博物馆文物信息部,重庆400015 [2]重庆邮电大学网络空间安全与信息学院,重庆400065

出  处:《计算机应用》2022年第S01期48-53,共6页journal of Computer Applications

基  金:国家自然科学基金资助项目(61802039);重庆市自然科学基金资助项目(cstc2020jcyj⁃msxmX0804)。

摘  要:在对文博数据进行知识图谱的构建时,从文本中抽取出有效的三元组尤为重要,因而命名实体识别成为挖掘文博数据的首要任务。传统的中文实体命名识别方法多采用深度神经网络模型,此类方法在对词进行向量化处理时只是将词映射成单一的词向量,并不能很好地表示词的多义性。预训练语言模型能够有效地对字进行向量化表示,将语义信息充分地结合。因此,针对文博数据提出一种基于BERT的预训练实体识别模型,采用BERT预训练模型进行词嵌入,通过利用双向长短期记忆网络(BiLSTM)模型结合上下文信息来增强词向量的语义信息,然后利用条件随机场(CRF)模型进行解码。与传统长短期记忆(LSTM)网络和BiLSTM-CRF模型相比,该模型在微软公开数据集(MSRA)以及自行标注文博知识数据集上表现突出,在文博知识数据集中,模型的准确率达到93.57%,召回率达到75.00%,F1值达到73.58%。When constructing the knowledge graph of culture and museum data,it is particularly important to extract effective triples from the text,therefore,named entity recognition has become the primary task of mining cultural and museum data.Traditional Chinese entity named recognition methods mostly use deep neural network models.This type of methods only map the word into a single word vector when vectorizing the word,and cannot well represent the ambiguity of the word.The pre-trained language model can effectively vectorize words and fully integrate semantic information.Therefore,a pre-trained entity recognition model based on BERT(Bidirectional Encoder Representations from Transformers)was proposed for culture and museum data.The BERT pre-training model was used for word embedding.The Bidirectional Long Short-Term Memory(BiLSTM)network was combined with context information to enhance the semantic information of the word vector,and then Conditional Random Field(CRF)model was used for decoding.Compared with the traditional Long Short-Term Memory(LSTM)network and BiLSTM-CRF models,the proposed model had outstanding performance on the Microsoft Open Data Set(MSRA)and self-labeled culture and museum datasets.In the cultural and museum knowledge dataset,the accuracy of the proposed model reached 93.57%,the recall reached 75.00%,and the F1 value reached 73.58%.

关 键 词:命名实体识别 预训练 知识图谱 自然语言处理 深度学习 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象