检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:周裕林 陈艳平 黄瑞章 秦永彬[1,2] 林川 ZHOU Yulin;CHEN Yanping;HUANG Ruizhang;QIN Yongbin;LIN Chuan(State Key Laboratory of Public Big Data,Guiyang 550025,China;College of Computer Science&Technology,Guizhou University,Guiyang 550025,China)
机构地区:[1]公共大数据国家重点实验室,贵阳550025 [2]贵州大学计算机科学与技术学院,贵阳550025
出 处:《西安交通大学学报》2022年第8期95-103,共9页Journal of Xi'an Jiaotong University
基 金:国家自然科学基金资助项目(62166007)。
摘 要:针对中文分词序列标注模型很难获取句子的长距离语义依赖,导致输入特征使用不充分、边界样本少导致数据不平衡的问题,提出了一种基于机器阅读理解模型的中文分词方法。将序列标注任务转换成机器阅读理解任务,通过构建问题信息、文本内容和词组答案的三元组,以有效利用句子中的输入特征;将三元组信息通过Transformer的双向编码器(BERT)进行预训练捕获上下文信息,结合二进制分类器预测词组答案;通过改进原有的交叉熵损失函数缓解数据不平衡问题。在Bakeoff2005语料库的4个公共数据集PKU、MSRA、CITYU和AS上的实验结果表明:所提方法的F_(1)分别为96.64%、97.8%、97.02%和96.02%,与其他主流的神经网络序列标注模型进行对比,分别提高了0.13%、0.37%、0.4%和0.08%。Conventional sequence models for Chinese word segmentation are difficult to encode long distance semantic dependencies of a sentence,cannot make full use of input features,and have few boundary samples for use,which leads to data imbalance.In view of this,this paper proposes a machine reading comprehension(MRC)model for Chinese word segmentation.First,the sequence labelling task for Chinese word segmentation is converted into a machine reading comprehension task.This model constructs a triple relationship among question information,text content and answers to enrich the input features.Then,the triple relationship information is pre-trained by bidirectional encoder representation from transformers(BERT)to capture the contextual information,and a binary classifier is used to predict the word answers.Finally,the original cross-entropy loss function is improved to alleviate the data imbalance between examples.The experiment results show that machine reading comprehension model achieves F_(1) value of 96.64%,97.8%,97.02% and 96.02% with four public datasets used:PKU,MSRA,CITYU and AS.Compared with other neural network sequence labeling models,this model improves the corresponding F_(1) value by 0.13%,0.37%,0.4% and 0.08%.
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.14.252.84