基于语境与文本结构融合的中文拼写纠错方法

Research on Chinese spelling correction based on the integration of context and text structure

作　　者：刘昌春张凯包美凯刘烨刘淇[1,2] Liu Changchun;Zhang Kai;Bao Meikai;Liu Ye;Liu Qi(School of Computer Science and Technology,University of Science and Technology of China,Hefei,230027,China;School of Data Science,University of Science and Technology of China,Hefei,230027,China)

机构地区：[1]中国科学技术大学计算机科学与技术学院,合肥230027 [2]中国科学技术大学大数据学院,合肥230027

出　　处：《南京大学学报（自然科学版）》2024年第3期451-463,共13页Journal of Nanjing University（Natural Science）

基　　金：国家重点研发计划(2021YFF0901003)。

摘　　要：在中文拼写纠错任务的处理中往往存在对句子的语义理解不够且对于汉字的语音和视觉信息利用较少的问题,针对这一问题,提出一种基于语境置信度和汉字相似度的纠错方法(ECS).该方法基于深度学习的理论,融合汉字的视觉相似度、汉字的语音相似度以及微调过的预训练BERT模型,能自动提取句子语义并利用汉字的相似性.具体地,通过对预训练的中文BERT模型进行微调,使之能适应下游的中文拼写纠错任务;同时,利用表意文字描述序列获取汉字的树形结构作为视觉信息,采用汉字的拼音序列作为语音信息;最后,利用编辑距离得出汉字的视觉和语音相似度,并将这些相似度数据与微调过的BERT模型融合,以实现纠错任务.在SIGHAN标准数据集上的测试结果显示,和基准模型相比,提出的ECS方法其F1-score提升巨大,在检错层面上提升2.1%,在纠错层面上提升2.8%,也验证了将汉字的语境信息、视觉信息与语音信息融合用于中文拼写纠错任务的适用性.In Chinese Spelling Correction(CSC)tasks,there are often problems such as insufficient semantic understanding of sentences and less use of phonetic and visual information of Chinese characters.Addressing these issues,we propose a novel error correction method based on context confidence and Chinese character similarity for Chinese spelling error correction(ECS).Based on deep learning principles,this approach integrates visual similarity of Chinese characters,and phonetic similarity of Chinese characters,and a fine⁃tuned pre⁃trained BERT model,which automatically extracts sentence semantics and exploits the similarity of Chinese characters.Specifically,we fine⁃tune the pre⁃trained Chinese BERT model to adapt to downstream Chinese spelling correction tasks.Then,we use the ideographic description sequence to capture the tree structure of Chinese characters as visual information and the phonetic sequence of Chinese characters as phonetic information.Finally,combining the visual and phonetic similarity(calculated by Levenshtein distance)of Chinese characters with the fine⁃tuned BERT model,we achieve the completion of the correction task.Experimental results on SIGHAN benchmark datasets show that the proposed ECS method has a huge improvement in F1⁃score compared with the baseline model,which is 2.1%higher on the error detection level and 2.8%higher on the error correction level,verifying the applicability of the fusion of context information,visual information and phonetic information for Chinese spelling correction tasks.

关键词：中文拼写纠错 BERT 汉字语音相似度汉字视觉相似度预训练模型

分类号：TP391.1[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于语境与文本结构融合的中文拼写纠错方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于语境与文本结构融合的中文拼写纠错方法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索