检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:左壮壮 王法玉 陈洪涛 ZUO Zhuangzhuang;WANG Fayu;CHEN Hongtao(School of Computer Science and Engineering,Tianjin University of Technology,Tianjin 300384,China)
机构地区:[1]天津理工大学计算机科学与工程学院,天津300384
出 处:《天津理工大学学报》2025年第1期83-89,共7页Journal of Tianjin University of Technology
基 金:国家重点研发计划(2021YFC3300402);天津理工大学教学研究与改革项目(YB22-12)。
摘 要:针对中文文本检错纠错研究任务,提出了基于知识增强的自然语言表示模型(enhanced representation through knowledge integration, ERNIE)与序列标注结合的中文文本检错纠错模型。该模型由检错和纠错两部分组成,检错阶段ERNIE使用全局注意力机制进行词向量编码输入到BiLSTM-CRF序列标注模型中,双向长短期记忆网络(bi-directional long short-term memory, BiLSTM)提取上下文的信息进行拼接生成双向的词向量,再通过条件随机场(conditional random field, CRF)计算联合概率增加对邻近词标签的依赖性优化整个序列,从而解决标注偏置等问题给出的错误标注。纠错阶段根据检错模型输出的结果采用不同策略分类纠错,将标注为错字、缺字的错误使用ERNIE掩码语言模型和混淆集匹配进行预测,对多字、乱序错误直接纠正。实验结果表明,引入序列标注根据错误类型进行分类纠错有效提升了纠错率,在SIGHAN数据集上测试F1达到了81.8%。Aiming at the research task of Chinese text error detection and correction,a Chinese text error detection and correction model combining enhanced representation through knowledge integration(ERNIE)and sequence annotation is proposed.The model consists of two parts:error detection and error correction.In the error detection stage,ERNIE uses the global attention mechanism to encode word vectors and input them into the BiLSTM-CRF sequence annotation model.The bi-directional long short-term memory(BiLSTM)bidirectional structure extracts contextual information and splits it to generate bidirectional word vectors.Then the joint probability was calculated by conditional random field(CRF)to increase the dependence of neighboring word labels to optimize the whole sequence,so as to solve the problems such as labeling bias and give wrong labeling.In the error correction stage,the different strategies are adopted to classify and correct errors according to the output results of the error detection model.Errors marked as wrong characters and missing characters are predicted by using ERNIE mask language model and confusion set matching,while multi-word and out-of-order errors are directly corrected.The experimental results show that the introduction of sequence annotation can effectively improve the error correction rate according to the error type,and the F1 test on SIGHAN dataset reaches 81.8%.
关 键 词:中文文本检错纠错 基于知识增强的自然语言表示模型 序列标注 双向长短期记忆网络 条件随机场 多策略纠错
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.147