字符敏感编辑距离的零样本汉字识别  

Character-aware edit distance for zero-shot Chinese character recognition

在线阅读下载全文

作  者:陈宇 王大寒 池雪可 江楠峰 张煦尧 王驰明 朱顺痣 Chen Yu;Wang Dahan;Chi Xueke;Jiang Nanfeng;Zhang Xuyao;Wang Chiming;Zhu Shunzhi(Fujian Key Laboratory of Pattern Recognition and Image Understanding,School of Computer and Information Engineering,Xiamen University of Technology,Xiamen 361024,China;State Key Laboratory of Multimodal Artificial Intelligence Systems,Institute of Automation,Chinese Academy of Sciences,Beijing 100190,China)

机构地区:[1]厦门理工学院计算机与信息工程学院福建省模式识别与图像理解重点实验室,厦门361024 [2]中国科学院自动化研究所多模态人工智能系统全国重点实验室,北京100190

出  处:《中国图象图形学报》2024年第11期3383-3400,共18页Journal of Image and Graphics

基  金:国家自然科学基金项目(61773325,62222609,62076236);多模态人工智能系统全国重点实验室开放课题(MAIS2024101);厦门市自然科学基金项目(3502Z202373058);福建省技术创新重点攻关及产业化项目(2023XQ023);福厦泉国家自主创新示范项目(2022FX4);国家工信部高技术船舶专项子专题(CBG4N21-4-4);福建省中青年教师教育科研项目——重点项目(JZ230050)。

摘  要:目的零样本汉字识别(zero-shot Chinese character recognition,ZSCCR)因其能在零或少训练样本下识别未见汉字而受到广泛关注。现有的零样本汉字识别方法大多采用基于部首序列匹配框架,即首先预测部首序列,然后根据表意描述序列(ideographic description sequence,IDS)字典进行最小编辑距离(minimum edit distance,MED)匹配。然而,现有的MED算法默认不同部首的替换代价、插入代价和删除代价相同,导致在匹配时候选字符类别存在距离代价模糊和冗余的问题。为此,提出了一种字符敏感编辑距离(character-aware edit distance,CAED)以正确匹配目标字符类别。方法通过设计多种部首信息提取方法,获得了更为精细化的部首描述,从而得到更精确的部首替换代价,提高了MED的鲁棒性和有效性;此外,提出部首计数模块预测样本的部首数量,从而形成代价门控以约束和调整插入和删除代价,克服了IDS序列长度预测不准确产生的影响。结果在手写汉字、场景汉字和古籍汉字等数据集上进行实验验证,与以往的方法相比,本文提出的CAED在识别未见汉字类别的准确率上分别提高了4.64%、1.1%和5.08%,同时对已见汉字类别保持相当的性能,实验结果充分表明了本方法的有效性。结论本文所提出的字符敏感编辑距离,使得替换、插入和删除3种编辑代价根据字符进行自适应调整,有效提升了对未见汉字的识别性能。Objective Zero-shot Chinese character recognition(ZSCCR)has attracted increasing attention in recent years due to its importance in recognizing unseen Chinese characters with zero/few training samples.The fundamental concept of zero-shot learning is to solve the new class recognition problem by generalizing semantic knowledge from seen classes to unseen classes,usually represented by auxiliary information such as attribute descriptions shared between different classes.Chinese characters comprise multiple radicals;therefore,radicals are often used as shared attributes between different Chinese character classes.Most existing ZSCCR methods adopt the radical-based sequence matching framework that recognizes the character by predicting the radical sequence,followed by minimum edit distance(MED)matching based on the ideographic description sequence(IDS)dictionary.The MED can quickly compare the predicted radical sequences individually with the IDS dictionary to measure the difference between the two sequences and thus determine the unseen Chinese character category.However,this algorithm is mainly based on a framework where the insertion,deletion,and substitution costs are all set to 1,assuming that the cost is the same between all pairs of radicals.However,in practice,the substitution cost between similar radicals should be lower than that between non-similar radicals.Moreover,this approach needs increased flexibility due to the excessively long or short length of the predicted IDS sequence,resulting in redundant insertion or deletion costs.Consequently,a character-aware edit distance(CAED)is proposed to extract refined radical substitution costs,and the impacts of insertion and deletion costs are considered.Method The CAED in this study adaptively adjusts the cost of substitution,insertion,and deletion in edit distance to match the unseen Chinese character category according to the sensitivity of each Chinese character.In ZSCCR,the key to the radical-based approach lies in identifying radical sequences and the metri

关 键 词:零样本汉字识别(ZSCCR) 表意描述序列(IDS) 编辑距离 字符敏感 部首信息 代价门控 

分 类 号:TP391.4[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象