基于多模态特征融合的场景文本识别

Scene text recognition based on multimodal feature fusion

作　　者：蔡明哲王满利窦泽亚张长森 Cai Mingzhe;Wang Manli;Dou Zeya;Zhang Changsen(School of Physics&Electronic Information Engineering,Henan Polytechnic University,Jiaozuo Henan 454003,China)

机构地区：[1]河南理工大学物理与电子信息学院,河南焦作454003

出　　处：《计算机应用研究》2025年第4期1274-1280,共7页Application Research of Computers

基　　金：国家自然科学基金资助项目(52074305);河南省科技攻关项目(242102221006);河南省研究生教育改革与质量提升工程资助项目(YJS2024AL026);河南理工大学光电传感与智能测控河南省工程实验室开放基金资助项目(HELPSIMC-2020-00X)。

摘　　要：为了解决自然场景文本图像因为遮挡、扭曲等原因难以识别的问题,提出基于多模态特征融合的场景文本识别网络(multimodal scene text recognition,MMSTR)。首先,MMSTR使用共享权重内部自回归的排列语言模型实现多种解码策略;其次,MMSTR在图像编码阶段提出残差注意力编码器(residual attention encoder,REA-encoder)提高了对浅层特征捕获能力,使得浅层特征能够传到更深的网络层,有效缓解了vision Transformer提取图像浅层特征不充分引起的特征坍塌问题;最后,针对解码过程中存在语义特征与视觉特征融合不充分的问题,MMSTR构建了决策融合模块(decision fusion module,DFM),利用级联多头注意力机制提高语义与视觉的融合程度。经过实验证明,MMSTR在ⅢT5K、ICDAR13等六个公共数据集上平均词准确率达到96.6%。此外,MMSTR在识别遮挡、扭曲等难以识别的文本图像方面相较于其他的主流算法具有显著优势。Toward addressing the challenges posed by occlusions,distortions,and other impediments in recognizing text within natural scenes,this paper proposed a scene text recognition network MMSTR based on multi-modal feature fusion.Firstly,MMSTR employed a shared-weight internal autoregressive permutation language model to facilitate a variety of decoding strategies.Secondly,during the image encoding phase,MMSTR introduced a REA-Encoder,which enhanced the capability of capturing shallow features,allowing them to propagate to deeper network layers.This effectively alleviated the issue of feature collapse resulting from the inadequate extraction of shallow image features by vision Transformer.Finally,to address the insufficient fusion of semantic and visual features during the decoding process,MMSTR constructed a DFM.The DFM utilized a cascaded multi-head attention mechanism to enhance the integration of semantic and visual features.Experimental evidence confirms that MMSTR attains an average word accuracy rate of 96.6% across six public datasets,including ⅢT5K and ICDAR13.Furthermore,MMSTR exhibits a significant advantage over other mainstream algorithms in the recognition of challenging text images that are obscured or distorted.

关键词：场景文本特征融合语言模型注意力机制残差网络

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于多模态特征融合的场景文本识别

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于多模态特征融合的场景文本识别

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索