序列标注模型中的字粒度特征提取方案研究——以CCKS2017:Task2临床病历命名实体识别任务为例被引量：9

Research on Feature Extraction Scheme of Chinese-character Granularity in Sequence Labeling Model——A Case Study About Clinical Named Entity Recognition of CCKS2017:Task2

作　　者：孙安于英香[1] 罗永刚[1,3] 王祺 Sun An;Yu Yingxiang;Luo Yonggang;Wang Qi(Information and Archival Department, Shanghai University, Shanghai 200444;Library, Henan University of Science and Technology, Luoyang 471023;College of Medical Instrument, Shanghai University of Medicine ＆ Health Sciences, Shanghai 200444;Department of Computer Seienee and Engineering, East China University of Science and Teehnology, Shanghai 200237)

机构地区：[1]上海大学图书情报档案系,上海200444 [2]河南科技大学图书馆,洛阳471023 [3]上海健康医学院医疗器械学院,上海201318 [4]华东理工大学计算机科学与技术系,上海200237

出　　处：《图书情报工作》2018年第11期103-111,共9页Library and Information Service

基　　金：国家社会科学基金一般项目“‘区域-国家’电子文件管理整合模型构建与实证研究”(项目编号:11BTQ039)研究成果之一

摘　　要：[目的／意义]针对中文语言表达特点，提出一种含分词标签的字粒度词语特征提取方法，有效提升了中文临床病历命名实体识别任务的F1值，同时该方法可以为其他中文序列标注模型所借鉴。[方法／过程]选取汉语词语的词性标注、关键词权值、依存句法分析三个特征，构筑字粒度序列标注模型的临床病历训练文本，语料来源CCKs2017：Task2。在不同特征组合方式下，采用条件随机场算法验证两种字粒度词语特征提取方案Method1与Method2。[结果／结论]在四种不同词语特征组合下，Method2相对于Method1在临床病历命名实体识别任务中性能均有所提升，四折交叉测试中F1值平均提升了0．23％。实验表明在中文分词技术日趋成熟的环境下，Method2相对Method1能够获得更好的词语特征表示，对中文字粒度序列标注模型的处理性能具有提升作用。[ Purpose/significance] According to the characteristics of Chinese language expression, this paper proposes a feature extraction method of words with word segmentation tag of character granularity, which can effectively improve the F1 value of Chinese clinical named entity recognition, and the method can be used for other Chinese sequence labeling model. [ Method/process] This paper chose three kinds of features of Chinese-words, including part-of-speech Tagging, keyword weight and dependency parsing, to construct the clinical cases training text in sequence labeling model of the Chinese-character granularity, and the corpus source is CCKS2017：Task2. Then, in different feature combination modes, this paper adopted CRF algorithm to verify Method I and Method 2 ,which are two kinds of words feature extraction methods for character granularity. [ Result/conclusion] Compared with Method 1, for the four different combinations of word features, Method 2 has been improved in the task of CNER, and the F1 value has increased by an average of 0.23% in the 4-fold cross-validation test. The experiment shows that in the context of mature Chinese word segmentation technology, Method2 can obtain better word feature representations than Method 1, and it has a lifting effect on the processing performance of Chinese-Character Granularity in Sequence Labeling Model.

关键词：命名实体识别字粒度特征提取序列标注模型条件随机场临床病历

分类号：TP391.1[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

序列标注模型中的字粒度特征提取方案研究——以CCKS2017:Task2临床病历命名实体识别任务为例被引量：9

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

序列标注模型中的字粒度特征提取方案研究——以CCKS2017:Task2临床病历命名实体识别任务为例 被引量：9

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

序列标注模型中的字粒度特征提取方案研究——以CCKS2017:Task2临床病历命名实体识别任务为例被引量：9