ERNIE和序列标注结合的中文文本检错纠错

Chinese text error detection and correction combined with ERNIE and sequence annotation

作　　者：左壮壮王法玉陈洪涛 ZUO Zhuangzhuang;WANG Fayu;CHEN Hongtao(School of Computer Science and Engineering,Tianjin University of Technology,Tianjin 300384,China)

机构地区：[1]天津理工大学计算机科学与工程学院,天津300384

出　　处：《天津理工大学学报》2025年第1期83-89,共7页Journal of Tianjin University of Technology

基　　金：国家重点研发计划(2021YFC3300402);天津理工大学教学研究与改革项目(YB22-12)。

摘　　要：针对中文文本检错纠错研究任务,提出了基于知识增强的自然语言表示模型(enhanced representation through knowledge integration, ERNIE)与序列标注结合的中文文本检错纠错模型。该模型由检错和纠错两部分组成,检错阶段ERNIE使用全局注意力机制进行词向量编码输入到BiLSTM-CRF序列标注模型中,双向长短期记忆网络(bi-directional long short-term memory, BiLSTM)提取上下文的信息进行拼接生成双向的词向量,再通过条件随机场(conditional random field, CRF)计算联合概率增加对邻近词标签的依赖性优化整个序列,从而解决标注偏置等问题给出的错误标注。纠错阶段根据检错模型输出的结果采用不同策略分类纠错,将标注为错字、缺字的错误使用ERNIE掩码语言模型和混淆集匹配进行预测,对多字、乱序错误直接纠正。实验结果表明,引入序列标注根据错误类型进行分类纠错有效提升了纠错率,在SIGHAN数据集上测试F1达到了81.8%。Aiming at the research task of Chinese text error detection and correction,a Chinese text error detection and correction model combining enhanced representation through knowledge integration(ERNIE)and sequence annotation is proposed.The model consists of two parts:error detection and error correction.In the error detection stage,ERNIE uses the global attention mechanism to encode word vectors and input them into the BiLSTM-CRF sequence annotation model.The bi-directional long short-term memory(BiLSTM)bidirectional structure extracts contextual information and splits it to generate bidirectional word vectors.Then the joint probability was calculated by conditional random field(CRF)to increase the dependence of neighboring word labels to optimize the whole sequence,so as to solve the problems such as labeling bias and give wrong labeling.In the error correction stage,the different strategies are adopted to classify and correct errors according to the output results of the error detection model.Errors marked as wrong characters and missing characters are predicted by using ERNIE mask language model and confusion set matching,while multi-word and out-of-order errors are directly corrected.The experimental results show that the introduction of sequence annotation can effectively improve the error correction rate according to the error type,and the F1 test on SIGHAN dataset reaches 81.8%.

关键词：中文文本检错纠错基于知识增强的自然语言表示模型序列标注双向长短期记忆网络条件随机场多策略纠错

分类号：TP391.1[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

ERNIE和序列标注结合的中文文本检错纠错

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

ERNIE和序列标注结合的中文文本检错纠错

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索