一种采用机器阅读理解模型的中文分词方法  被引量:2

Machine Reading Comprehension Model for Chinese Word Segmentation

在线阅读下载全文

作  者:周裕林 陈艳平 黄瑞章 秦永彬[1,2] 林川 ZHOU Yulin;CHEN Yanping;HUANG Ruizhang;QIN Yongbin;LIN Chuan(State Key Laboratory of Public Big Data,Guiyang 550025,China;College of Computer Science&Technology,Guizhou University,Guiyang 550025,China)

机构地区:[1]公共大数据国家重点实验室,贵阳550025 [2]贵州大学计算机科学与技术学院,贵阳550025

出  处:《西安交通大学学报》2022年第8期95-103,共9页Journal of Xi'an Jiaotong University

基  金:国家自然科学基金资助项目(62166007)。

摘  要:针对中文分词序列标注模型很难获取句子的长距离语义依赖,导致输入特征使用不充分、边界样本少导致数据不平衡的问题,提出了一种基于机器阅读理解模型的中文分词方法。将序列标注任务转换成机器阅读理解任务,通过构建问题信息、文本内容和词组答案的三元组,以有效利用句子中的输入特征;将三元组信息通过Transformer的双向编码器(BERT)进行预训练捕获上下文信息,结合二进制分类器预测词组答案;通过改进原有的交叉熵损失函数缓解数据不平衡问题。在Bakeoff2005语料库的4个公共数据集PKU、MSRA、CITYU和AS上的实验结果表明:所提方法的F_(1)分别为96.64%、97.8%、97.02%和96.02%,与其他主流的神经网络序列标注模型进行对比,分别提高了0.13%、0.37%、0.4%和0.08%。Conventional sequence models for Chinese word segmentation are difficult to encode long distance semantic dependencies of a sentence,cannot make full use of input features,and have few boundary samples for use,which leads to data imbalance.In view of this,this paper proposes a machine reading comprehension(MRC)model for Chinese word segmentation.First,the sequence labelling task for Chinese word segmentation is converted into a machine reading comprehension task.This model constructs a triple relationship among question information,text content and answers to enrich the input features.Then,the triple relationship information is pre-trained by bidirectional encoder representation from transformers(BERT)to capture the contextual information,and a binary classifier is used to predict the word answers.Finally,the original cross-entropy loss function is improved to alleviate the data imbalance between examples.The experiment results show that machine reading comprehension model achieves F_(1) value of 96.64%,97.8%,97.02% and 96.02% with four public datasets used:PKU,MSRA,CITYU and AS.Compared with other neural network sequence labeling models,this model improves the corresponding F_(1) value by 0.13%,0.37%,0.4% and 0.08%.

关 键 词:中文分词 序列标注 歧义词 机器阅读理解 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象