基于SoftLexicon和对抗训练的中文医疗命名实体识别

Chinese Medical Named Entity Recognition Based on Soft-Lexicon and Adversarial Training

作　　者：潘世鹏吐尔地·托合提梁毅艾斯卡尔·艾木都拉[1,2] PAN Shipeng;Turdi Tohti;LIANG Yi;Askar Hamdulla(School of Computer Science and Technology,Xinjiang University,Urumqi 830017,China;Xinjiang Key Laboratory of Multilingual Information Technology,Urumqi 830017,China)

机构地区：[1]新疆大学计算机科学与技术学院,新疆乌鲁木齐830017 [2]新疆多语种信息技术重点实验室,新疆乌鲁木齐830017

出　　处：《山西大学学报（自然科学版）》2024年第2期260-268,共9页Journal of Shanxi University(Natural Science Edition)

基　　金：国家自然科学基金(62166042;U2003207);新疆维吾尔自治区自然科学基金(2021D01C076);国防科技基金加强计划(2021-JCJQ-JJ-0059)。

摘　　要：现有的医疗实体识别模型当中,多数模型不能充分提取和利用文本序列当中词汇信息,且模型结构复杂,使得模型在面临医疗领域的文本时存在实体边界识别不准、鲁棒性较差等问题,并且多数基于字粒度的命名实体识别(Named Entity Recognition,NER)方法对信息遗漏此类问题解决不够完善。针对此类问题,本文提出了一种基于字词融合和对抗训练的命名实体识别模型。模型使用预训练模型BERT(Bidirectional Encoder Representation from Transformers)获取文本序列的字向量;然后使用SoftLexicon引入词典信息并在字向量中添加对抗训练生成的扰动样本;最后使用BiLSTM-CRF(Bi-Long Short-Term Memory-Condition Random Field)进行特征提取并获取序列标注结果。所提出模型在数据集CCKS2019和CCKS2020上进行实验,F1值分别到达了85.07%和90.39%。实验结果表明,与基准模型相比,该模型的F1值提升了2.31%和2.88%,说明字词融合方法和对抗训练相结合能够有效识别医疗实体。In existing medical entity recognition models,most of them cannot fully extract and utilize the lexical information in the text sequence,and their model structures are complex.This makes these models face problems such as inaccurate entity boundary recognition and poor robustness when dealing with medical texts.Additionally,most word-granularity based named entity recognition(NER)methods are not perfect in solving the problem of information omission.To address these problems,a named entity recognition model based on word fusion and adversarial training is proposed in this paper.The model uses a pre-trained model BERT to obtain word vectors of text sequences.Then the SoftLexicon is used to introduce lexical information and add perturbation samples generated by adversarial training to the word vectors.Finally,the BiLSTM-CRF is used to extract features and obtain sequence annotation results.The proposed model is experimented on the datasets CCKS2019 and CCKS2020,where the F1 values reach 85.07%and 90.39%,respectively.The experimental results show that compared with the baseline model,the F1 value of this model has increased by 2.31%and 2.88%,indicating that the combination of word fusion method and adversarial training can effectively identify medical entities.

关键词：命名实体识别字词融合对抗训练 PGD

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于SoftLexicon和对抗训练的中文医疗命名实体识别

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于SoftLexicon和对抗训练的中文医疗命名实体识别

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索