基于MPNet与BiLSTM的COVID-19临床文本命名实体识别方法  被引量:1

Named entity recognition of COVID-19 clinical text based on MPNet and BiLSTM

在线阅读下载全文

作  者:蔡晓琼 郑增亮 苏前敏[1] 郭晶磊 CAI Xiaoqiong;ZHENG Zengliang;SU Qianmin;GUO Jinglei(College of Electronic and Electrical Engineering,Shanghai University of Engineering Science,Shanghai 201620,China;School of Basic Medical Sciences,Shanghai University of Traditional Chinese Medicine,Shanghai 201203,China)

机构地区:[1]上海工程技术大学电子电气工程学院,上海201620 [2]上海中医药大学基础医学院,上海201203

出  处:《智能计算机与应用》2023年第1期164-170,177,共8页Intelligent Computer and Applications

基  金:“十三五”国家科技重大专项(2018ZX09711001-009-001);上海市2017年度科技创新行动计划(17401970900)。

摘  要:随着生物医学研究与信息化技术的迅速发展,临床医学文献数量呈指数级增长,利用文本挖掘技术自动提取医学知识逐渐成为当前研究热点。针对目前新型冠状病毒肺炎(Corona Virus Disease 2019,COVID-19)临床文本研究匮乏、语料不足与标注质量不高等问题,本文结合UMLS医学语义网络和专家定义方式,制定医学实体标注规则,建立命名实体识别语料库,明确实体识别任务。其次,提出了一种基于MPNet与BiLSTM的COVID-19临床文本命名实体识别模型。通过预训练语言模型获得文本的向量化表示,解决了一词多义问题;采用双向长短期记忆网络,捕捉文本的长距离依赖;最后引入条件随机场,实现句子级序列注释,输出完整的最优标签序列。实验结果表明,MPNet-BiLSTM-CRF模型在COVID-19临床命名实体识别数据集上取得了较好的表现。With the rapid development of biomedical research and information technology, the amount of clinical medical literature is growing exponentially, and the automatic extraction of medical knowledge using text mining technology is gradually becoming a current research hotspot. T In view of the current lack of research on Corona Virus Disease 2019(COVID-19) clinical texts, insufficient corpus, and low quality of labeling, this paper formulates medical entity labeling rules based on the UMLS medical semantic network and expert definition methods, establishes a named entity recognition corpus, and clarifies the entity recognition task. Secondly, a COVID-19 clinical text named entity recognition model based on MPNet and BiLSTM is proposed to obtain a vectorized representation of the text by pre-training the language model to solve the problem of multiple meanings of a word;a bidirectional long and short-term memory network is used in order to capture the long-range dependency of this paper;finally, a conditional random field is introduced to achieve sentence-level sequence annotation and output a complete sequence of optimal labels. The experimental results show that the MPNet-BiLSTM-CRF model achieves better performance on the COVID-19 clinical named entity identification dataset.

关 键 词:COVID-19 命名实体识别 双向长短期记忆网络 条件随机场 

分 类 号:TP391[自动化与计算机技术—计算机应用技术] R319[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象