基于文本序列错误概率和中文拼写错误概率融合的汉语纠错算法  

Chinese spelling correction based on fusion of text sequence error probability and Chinese spelling error probability

在线阅读下载全文

作  者:孙哲[1] 禹可[1] 吴晓非[1] Sun Zhe;Yu Ke;Wu Xiaofei(School of Artificial Intelligence,Beijing University of Posts&Telecommunications,Beijing 100876,China)

机构地区:[1]北京邮电大学人工智能学院,北京100876

出  处:《计算机应用研究》2023年第8期2292-2297,共6页Application Research of Computers

摘  要:中文拼写纠错是一项检测和纠正文本中拼写错误的任务。大多数中文拼写错误是在语义、读音或字形上相似的字符被误用,因此常见的做法是对不同模态提取特征进行建模。但将不同特征直接融合或是利用固定权重进行求和,使得不同模态信息之间的重要性关系被忽略以及模型在识别错误时会出现偏差,阻止了模型以有效的方式学习。为此,提出了一种新的模型以改善这个问题,称为基于文本序列错误概率和中文拼写错误概率融合的汉语纠错算法。该方法使用文本序列错误概率作为动态权重、中文常见拼写错误概率作为固定权重,对语义、读音和字形信息进行了高效融合。模型能够合理控制不同模态信息流入混合模态表示,更加针对错误发生处进行学习。在SIGHAN基准上进行的实验表明,所提模型的各项评估分数在不同数据集上均有提升,验证了该算法的可行性。Chinese spelling error correction is a task to detect and correct spelling errors in text.Most Chinese spelling errors are the misuse of semantically,phonetically or morphologically similar characters,so it is common to extract features for mode-ling different modalities.However,the direct fusion of different features or summation using fixed weights prevent the model from learning in an efficient way by ignoring the importance relationship between the information of different modalities and the bias of the model in identifying errors.This paper proposed a new model to improve this problem,called the Chinese error correction algorithm based on the fusion of text sequence error probability and Chinese spelling error probability.The method used the text sequence error probability as the dynamic weight and the common Chinese spelling error probability as the fixed weight to efficiently fuse semantic,phonetic and morphologic information.The model was able to reasonably control the inflow of different modal information into the mixed modal representation and learnt more specifically where the errors occurred.Experiments conducted on the SIGHAN benchmark show that all evaluation scores of the proposed model are improved on different datasets,which validates the feasibility of the algorithm.

关 键 词:中文拼写纠错 错误概率 预训练 信息融合 序列到序列模型 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象