基于门控特征融合的中文错别字纠正模型  

Chinese Spelling Correction Model Based on Gated Feature

在线阅读下载全文

作  者:周雨昊 孙哲[1] 吴晓非[1] 禹可[1] ZHOU Yuhao;SUN Zhe;WU Xiaofei;YU Ke(School of Artificial Intelligence,Beijing University of Posts and Telecommunications,Beijing 100876,China)

机构地区:[1]北京邮电大学人工智能学院,北京100876

出  处:《北京邮电大学学报》2023年第4期91-96,122,共7页Journal of Beijing University of Posts and Telecommunications

基  金:国家自然科学基金项目(61601046)。

摘  要:针对在中文错别字纠正中,平等地融合汉字的语义、读音和字形信息进行建模的方法会由于错误的读音或字形信息而影响模型性能的问题,提出了一种基于门控特征融合的中文错别字纠正模型,利用自适应门控来选择性地融合语义、读音和字形信息,提升模型性能并加强模型的可解释性。此外,使用改进的四角号码编码汉字的字形信息,有效地提取了汉字的字形特征,并且基于此扩展了模型预训练时的字形相似混淆集。使用了基于混淆集替换的预训练掩码策略,使模型能有效学习文本错误知识。在公开数据集SIGHAN13、SIGHAN14和SIGHAN15上,所提模型分别取得了78.7%、67.8%和77.7%的纠错F1分数,相比于最优基线模型分别提升了1.5%、1.5%和1.0%。In response to the problem of model performance being affected by incorrect pronunciation or glyph when fusing semantic,phonetic and glyph information of Chinese characters equally in Chinese spelling correction,a Chinese spelling correction model based on gated feature fusion is proposed,which uses adaptive gates to selectively fuse semantic,phonetic and glyph information to improve the performance of the model and enhance the interpretability of the model.The improved four corner code is used to encode the glyph features of Chinese characters,effectively extracting the glyph features of Chinese characters,and based on this,the glyph similarity confusion set in the pre-training stage of the model is expanded.The pre-training mask strategy based on confusion set replacement is used to enable the model to effectively learn the erroneous knowledge contained in the text.On the public SIGHAN13,SIGHAN14 and SICHAN15 datasets,the proposed model achieves correction F1-scores of 78.7%,67.8%and 77.7%,respectively,which are 1.5%,1.5%and 1.0%higher than the optimal baseline model.

关 键 词:中文错别字纠正 预训练 门控特征融合 四角号码 

分 类 号:TP183[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象