中文工艺规范文本分词语料的构建与研究  被引量:3

Construction and Research of Chinese Word Segmentation Corpus of Process Specification Text

在线阅读下载全文

作  者:王裴岩[1] 张莹欣 付小强 陈佳欣 徐楠[1] 蔡东风[1] WANG Peiyan;ZHANG Yingxin;FU Xiaoqiang;CHEN Jiaxin;XU Nan;CAI Dongfeng(Human-Computer Intelligence Research Center,Shenyang Aerospace University,Shenyang 110136,China;Aviation Manufacturing Technology Research Institute,COMAC Shanghai Aircraft Manufacturing,Shanghai 201324,China)

机构地区:[1]沈阳航空航天大学人机智能研究中心,沈阳110136 [2]中国商飞上海飞机制造有限公司航空制造技术研究所,上海201324

出  处:《计算机科学》2023年第S02期63-68,共6页Computer Science

基  金:辽宁省应用基础研究计划(2022JH2/101300248)。

摘  要:中文分词是处理工艺规范文本的一项基本任务,并且在工艺知识图谱与智能问答等下游任务中发挥着重要作用。工艺规范文本分词面临的一个挑战是缺乏高质量标注的语料,特别是面向术语、名词短语、工艺参数、数量词等特殊语言现象的分词规范。文中面向工艺规范文本制定了专用分词规范,收集并标注了一个中文工艺规范文本分词语料(WS-MPST),含11900个句子与255160个词,4名标注者分词标注一致性达95.25%。在WS-MPST语料上对著名的BiLSTM-CRF与BERT-CRF模型进行了对比实验,F1值分别达到92.61%与93.69%。实验结果表明,构建专用的工艺规范分词语料是必要的。对实验结果的深入分析揭示了未登录词与中文非中文字符混合构成的词是工艺规范文本分词的难点,也为今后工艺规范文本及相关领域的分词研究提供了一定的指导。Chinese word segmentation is a basic task for process specification text processing,which has a critical impact on downstream tasks such as process knowledge graphs and intelligent Q&A systems.One of the challenges faced by word segmentation of process specification texts is the lack of high-quality annotated corpus,especially word segmentation specifications for special language phenomena such as terms,noun phrases,process parameters,and quantifiers.This paper formulates a special word segmentation specification for the process specification text,collects and annotates a word segmentation corpus for Chinese process specification text(WS-MPST),including 11900 sentences and 255160 words,and the consistency of word segmentation by 4 annotators achieves 95.25%.On the WS-MPST corpus,the famous BiLSTM-CRF and BERT-CRF models are tested,and the F1 values achieves 92.61%and 93.69%respectively.Experimental results show that it is necessary to construct a special word segmentation corpus for process specification test.The in-depth analysis of experimental results reveals that the out-of-vocabularywords and the words which contain Chinese and non-Chinese characters are difficultto segment in process specification texts,which provides some guidance for future word segmentation research in process specification texts and related fields.

关 键 词:中文分词 工艺规范文本 分词规范 分词语料 分词模型 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象