基于字词融合的低词汇信息损失中文命名实体识别方法被引量：1

Word-Character Model with Low Lexical Information Loss for Chinese NER

作　　者：郭志强关东海袁伟伟[1] GUO Zhiqiang;GUAN Donghai;YUAN Weiwei(School of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 211106,China)

机构地区：[1]南京航空航天大学计算机科学与技术学院,南京211106

出　　处：《计算机科学》2024年第8期272-280,共9页Computer Science

基　　金：航空基金(ASFC-20200055052005)。

摘　　要：中文命名实体识别(CNER)任务是一种自然语言处理技术,旨在识别文本中具有特定类别的实体,如人名、地名、组织机构名等,它是问答系统、机器翻译、信息抽取等自然语言应用的基础底层任务。由于中文不具备类似英文这样的天然分词结构,基于词的NER模型在中文命名实体识别上的效果会因分词错误而显著降低,基于字符的NER模型又忽略了词汇信息的作用,因此,近年来许多研究开始尝试将词汇信息融入字符模型中。WC-LSTM通过在词汇的开始字符和结束字符中注入词汇信息,使模型性能获得了显著的提升。然而,该模型依然没有充分利用词汇信息,因此在其基础上提出了基于字词融合的低词汇信息损失NER模型LLL-WCM,对词汇的所有中间字符融入词汇信息,避免了词汇信息损失。同时,引入了两种编码策略平均(avg)和自注意力机制(self-attention)以提取所有词汇信息。在4个中文数据集上进行实验,结果表明,与WC-LSTM相比,该方法的F1值分别提升了1.89%,0.29%,1.10%和1.54%。Chinese named entity recognition(CNER)task is a natural language processing technique that aims to recognize entities with specific categories in text,such as names of people,places,organizations.It is a fundamental underlying task of natural language applications such as question and answer systems,machine translation,and information extraction.Since Chinese does not have a natural word separation structure like English,the effectiveness of word-based NER models for Chinese named entity recognition is significantly reduced by word separation errors,and character-based NER models ignore the role of lexical information.In recent years,many studies have attempted to incorporate lexical information into character-based models,and WC-LSTM has achieved significant improvements in model performance by injecting lexical information into the start and end characters of a word.However,this model still does not fully utilize lexical information,so based on it,LLL-WCM(word-character model with low lexical information loss)is proposed to incorporate lexical information for all intermediate characters of the lexicon to avoid lexical information loss.Meanwhile,two encoding strategies average and self-attention mechanism are introduced to extract all lexical information.Experiments are conducted on four Chinese datasets,and the results show that the F1 values of this method are improved by 1.89%,0.29%,1.10%and 1.54%,respectively,compared with WC-LSTM.

关键词：命名实体识别自然语言处理词汇信息损失中间字符编码策略

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于字词融合的低词汇信息损失中文命名实体识别方法被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于字词融合的低词汇信息损失中文命名实体识别方法 被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于字词融合的低词汇信息损失中文命名实体识别方法被引量：1