一种采用机器阅读理解模型的中文分词方法被引量：2

Machine Reading Comprehension Model for Chinese Word Segmentation

作　　者：周裕林陈艳平黄瑞章秦永彬[1,2] 林川 ZHOU Yulin;CHEN Yanping;HUANG Ruizhang;QIN Yongbin;LIN Chuan(State Key Laboratory of Public Big Data,Guiyang 550025,China;College of Computer Science&Technology,Guizhou University,Guiyang 550025,China)

机构地区：[1]公共大数据国家重点实验室,贵阳550025 [2]贵州大学计算机科学与技术学院,贵阳550025

出　　处：《西安交通大学学报》2022年第8期95-103,共9页Journal of Xi'an Jiaotong University

基　　金：国家自然科学基金资助项目(62166007)。

摘　　要：针对中文分词序列标注模型很难获取句子的长距离语义依赖,导致输入特征使用不充分、边界样本少导致数据不平衡的问题,提出了一种基于机器阅读理解模型的中文分词方法。将序列标注任务转换成机器阅读理解任务,通过构建问题信息、文本内容和词组答案的三元组,以有效利用句子中的输入特征;将三元组信息通过Transformer的双向编码器(BERT)进行预训练捕获上下文信息,结合二进制分类器预测词组答案;通过改进原有的交叉熵损失函数缓解数据不平衡问题。在Bakeoff2005语料库的4个公共数据集PKU、MSRA、CITYU和AS上的实验结果表明:所提方法的F_(1)分别为96.64%、97.8%、97.02%和96.02%,与其他主流的神经网络序列标注模型进行对比,分别提高了0.13%、0.37%、0.4%和0.08%。Conventional sequence models for Chinese word segmentation are difficult to encode long distance semantic dependencies of a sentence,cannot make full use of input features,and have few boundary samples for use,which leads to data imbalance.In view of this,this paper proposes a machine reading comprehension(MRC)model for Chinese word segmentation.First,the sequence labelling task for Chinese word segmentation is converted into a machine reading comprehension task.This model constructs a triple relationship among question information,text content and answers to enrich the input features.Then,the triple relationship information is pre-trained by bidirectional encoder representation from transformers(BERT)to capture the contextual information,and a binary classifier is used to predict the word answers.Finally,the original cross-entropy loss function is improved to alleviate the data imbalance between examples.The experiment results show that machine reading comprehension model achieves F_(1) value of 96.64%,97.8%,97.02% and 96.02% with four public datasets used:PKU,MSRA,CITYU and AS.Compared with other neural network sequence labeling models,this model improves the corresponding F_(1) value by 0.13%,0.37%,0.4% and 0.08%.

关键词：中文分词序列标注歧义词机器阅读理解

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种采用机器阅读理解模型的中文分词方法被引量：2

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

一种采用机器阅读理解模型的中文分词方法 被引量：2

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

一种采用机器阅读理解模型的中文分词方法被引量：2