基于预训练语言模型的语法错误纠正方法  被引量:2

Grammatical Error Correction by Transferring Learning Based on Pre-Trained Language Model

在线阅读下载全文

作  者:韩明月 王英林[1] HAN Mingyue;WANG Yinglin(School of Information Management and Engineering,Shanghai University of Finance and Economics,Shanghai 200433,China)

机构地区:[1]上海财经大学信息管理与工程学院,上海200433

出  处:《上海交通大学学报》2022年第11期1554-1560,共7页Journal of Shanghai Jiaotong University

摘  要:自然语言处理中的语法错误纠正(GEC)任务存在着低资源性的问题,学习GEC模型需要耗费大量的标注成本以及训练成本.对此,采用从掩码式序列到序列的预训练语言生成模型(MASS)中的迁移学习方式,充分利用预训练模型已提取的语言特征,在GEC的标注数据上微调模型,结合特定的前处理、后处理方法改善GEC模型的表现,从而提出一种新的GEC系统(MASS-GEC).在两个公开的GEC任务中评估该系统,在有限的资源下,与当前GEC系统相比,达到了更好的效果.具体地,在CoNLL14数据集上,该系统在强调查准率的指标F上表现分数为57.9;在JFLEG数据集上,该系统在基于系统输出纠正结果与参考纠正结果n元语法重合度的评估指标GLEU上表现分数为59.1.该方法为GEC任务低资源问题的解决提供了新视角,即从自监督预训练语言模型中,利用适用于GEC任务的文本特征,辅助解决GEC问题.Grammatical error correction(GEC) is a low-resource task, which requires annotations with high costs and is time consuming in training. In this paper, the MASS-GEC is proposed to solve this problem by transferring learning from a pre-trained language generation model, and masked sequence is proposed to sequence pre-training for language generation(MASS). In addition, specific preprocessing and postprocessing strategies are applied to improve the performance of the GEC system. Finally, this system is evaluated on two public datasets and a competitive performance is achieved compared with the state-of-the-art work with limited resources. Specifically, this system achieves 57.9 in terms of Fscore which emphasizes more on precision on the CoNLL2014 task. On the JFLEG task, the MASS-GEC achieves 59.1 in terms of GLEU score which measures the n-gram coincidence between the output of the model and the correct answer manually annotated. This paper provides a new perspective that the low-resource problem in GEC can be solved well by transferring the general language knowledge from the self-supervised pre-trained language model.

关 键 词:语法错误纠正 自然语言生成 序列到序列 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象