结合部首字形和层级结构的手写汉字纠错方法  被引量:2

A method of radical form and hierarchical structure based handwritten Chinese character error correction

在线阅读下载全文

作  者:李云青 杜俊[1] 胡鹏飞[1] 张建树 Li Yunqing;Du Jun;Hu Pengfei;Zhang Jianshu(National Engineering Research Center of Speech and Language Information Processing,University of Science and Technology of China,Hefei 230026,China;iFLYTEK CO.,LTD.,Hefei 230088,China)

机构地区:[1]中国科学技术大学语音及语言信息处理国家工程研究中心,合肥230026 [2]科大讯飞股份有限公司,合肥230088

出  处:《中国图象图形学报》2023年第8期2382-2395,共14页Journal of Image and Graphics

摘  要:目的手写汉字纠错(handwritten Chinese character error correction,HCCEC)任务具有两重性,即判断汉字正确性和对错字进行纠正,该任务在教育场景下应用广泛,可以帮助学生学习汉字、纠正书写错误。由于手写汉字具有复杂的空间结构、多样的书写风格以及巨大的数量,且错字与正确字之间具有高度的相似性,因此,手写汉字纠错的关键是如何精确地建模一个汉字。为此,提出一种层级部首网络(hierarchical radical network,HRN)。方法从部首字形的角度出发,挖掘部首形状结构上的相似性,通过注意力模块捕获包含部首信息的细粒度图像特征,增大相似字之间的区分性。另外,结合汉字本身的层级结构特性,采用基于概率解码的思路,对部首的层级位置进行建模。结果在手写汉字数据集上进行实验,与现有方案相比,HRN在正确字测试集与错字测试集上,精确率分别提升了0.5%和9.8%,修正率在错字测试集上提升了15.3%。此外,通过注意力机制的可视化分析,验证了HRN可以捕捉包含部首信息的细粒度图像特征。部首表征之间的欧氏距离证明了HRN学习到的部首表征向量中包含了部首的字形结构信息。结论本文提出的HRN能够更好地对相似部首进行区分,进而精确地区分正确字与错字,具有很强的鲁棒性和泛化性。Objective Handwritten Chinese character error correction(HCCEC)is developed to handle the complex hierarchical structure,multiple writing styles,and large-scale character vocabulary of Chinese characters recently.The HCCEC is focused on two aspects for assessment and correction.The assessment can be used to determine whether a given handwritten isolated character is correct or not.The correction can be used to locate and correct specific character-misspelled errors.However,HCCEC has its unique chateristics beyond handwritten Chinese character recognition(HCCR)on three aspects as mentioned below:first,such categories of misspelled characters are endless to deal with more inquality Chinese characters,which puts a high demand on the generalization ability of the model.We assume that the training samples are right characters,in which both right characters and misspelled ones are involved in test set.The transfer learning ability of the model is still challenged to handle unclear misspelled characters.Therefore,HCCEC is melted into a generalized zeroshot learning(GZSL)problem further.Compared to zero-shot learning,GZSL-related test set contains seen and unseen classes,which makes it more realistic and challenging.Simutaneously,characters-misspelled misclassification is to be optimized as the right ones when testing.Second,misspelled characters could be quite similar to the right ones.It requires the ability of the model to capture fine-grained features.Third,to optimize HCCR,HCCEC-relevant verification is oriented to link corresponding right characters with misspelled characters.Method Radical-between similarities is developed in terms of radical shape and structure,and a hierarchical radical network(HRN)is melted into.For the analysis of Chinese characters,the key issue is to extract radical and structural information.For similar radicals,their distance in the representation space should be close.The completed radical information is beneficial for similar characters-between clarification,which is crucial for resolvin

关 键 词:手写汉字纠错(HCCEC) 汉字识别 部首分析 广义零样本学习(GZSL) 注意力机制 卷积神经网络(CNN) 

分 类 号:TP391.4[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象