IHCCD:非规范手写汉字识别数据集  

IHCCD:dataset for identification of irregular handwritten Chinese characters

在线阅读下载全文

作  者:季佳美 邵允学 季倓正 Ji Jiamei;Shao Yunxue;Ji Tanzheng(School of Computer and Information Engineering(School of Artificial Intelligence),Nanjing Tech University,Nanjing 211816,China)

机构地区:[1]南京工业大学计算机与信息工程学院(人工智能学院),南京211816

出  处:《中国图象图形学报》2024年第11期3345-3356,共12页Journal of Image and Graphics

基  金:钢铁冶金及资源利用省部共建教育部重点实验室2023年开放研究课题(FMRUlab23-02)。

摘  要:目的随着深度学习技术的快速发展,规范手写汉字识别(handwritten Chinese character recognition,HCCR)任务已经取得突破性进展,但对非规范书写汉字识别的研究仍处于萌芽阶段。受到书法流派和书写习惯等原因影响,手写汉字常常与打印字体差异显著,导致同类别文字的整体结构差异非常大,基于现有数据集训练得到的识别模型,无法准确识别非规范书写的汉字。方法为了推动非规范书写汉字识别的研究工作,本文制做了首套非规范书写的汉字数据集(irregular handwritten Chinese character dataset,IHCCD),目前共包含3755个类别,每个类别有30幅样本。还给出了经典深度学习模型ResNet,CBAM-ResNet,Vision Transformer,Swin Transformer在本文数据集上的基准性能。结果实验结果表明,虽然以上经典网络模型在规范书写的CASIA-HWDB1.1数据集上能够取得良好性能,其中Swin Transformer在CASIA-HWDB1.1数据集上最高精度达到了95.31%,但是利用CASIA-HWDB1.1训练集训练得到的网络模型,在IHCCD测试集上的识别结果较差,最高精度也只能达到30.20%。在加入IHCCD训练集后,所有的经典模型在IHCCD测试集上的识别性能均得到了较大提升,最高精度能达到89.89%,这表明IHCCD数据集对非规范书写汉字识别具有研究意义。结论现有OCR识别模型还存在局限性,本文收集的IHCCD数据集能够有效增强识别模型泛化性能。该数据集下载链接https://pan.baidu.com/s/1PtcfWj3yUSz68o2ZzvPJOQ?pwd=66Y7。Objective With the rapid development of deep learning technology,the task of handwritten Chinese character recognition(HCCR)has made breakthrough progress.Initially,text recognition research focused primarily on the recognition of English characters and numbers.However,with the deepening of artificial intelligence technology,numerous researchers have begun to focus on the field of Chinese character recognition.In recent years,Chinese character recognition has been widely used in several application scenarios and currently has a wide range of application scenarios in the fields of bank bill recognition,mail sorting,and office automation.Chinese characters are the most widely used language in the world with the richest information meaning and are an important language carrier for people’s communication.Therefore,the research on Chinese character recognition has a crucial value.However,despite these advancements,the recognition of irregular handwritten Chinese characters remains a challenging task.Handwritten Chinese characters are often influenced by various calligraphic styles and individual writing habits,leading to notable deviations from regular printed fonts.These variations can result in considerable differences in the overall structure of characters within the same category.Therefore,recognition models trained on these regular datasets may struggle to accurately identify irregularly handwritten Chinese characters encountered in real-world scenarios.For example,when sending a picture to WeChat,the text in the picture may involve sensitive words.During the identification of words by the text recognition engine,if these words are regular writing,then the engine can accurately identify and filter these sensitive words.However,some people intentionally avoid the identification of the text recognition engine due to irregular handwriting to circumvent regulation;thus,the search engine cannot recognize these words.Therefore,the research on the recognition of irregular handwritten Chinese characters is of considera

关 键 词:非规范书写 手写汉字识别(HCCR) IHCCD数据集 深度学习 经典分类模型 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象