多模态语义交互的文本图像超分辨率重构

Super-resolution reconstruction of text image with multimodal semantic interaction

作　　者：韩玉兰罗轶宏崔玉杰兰朝凤 HAN Yulan;LUO Yihong;CUI Yujie;LAN Chaofeng(College of Measurement and Control Technology and Communication Engineering,Harbin University of Science and Technology,Harbin 150080,China)

机构地区：[1]哈尔滨理工大学测控技术与通信工程学院,黑龙江哈尔滨150080

出　　处：《光学精密工程》2025年第1期135-147,共13页Optics and Precision Engineering

基　　金：国家自然科学基金资助项目(No.11804068);黑龙江省自然科学基金资助项目(No.LH2020F033);黑龙江省省属高等学校基本科研业务资助项目(No.2020-KYYWF-0342)。

摘　　要：针对现有方法在文本图像特征表示缺乏尺度变换,分辨率不足导致识别器难以提取到正确的文本内容信息指导重构网络的问题,提出多模态语义交互的文本图像超分辨率重构方法。利用语义推理模块中的注意力掩码对文本内容信息进行校正,获得语义先验信息,约束并指导网络重构语义正确的文本超分辨率重构图像。为增强网络的表征能力,适应不同形状和长度的文本图像,设计了多模态语义交互块,其基本单元由视觉双流集成块、跨模态自适应融合块和正交双向门控循环单元组成。视觉双流集成块利用全局统计特性和局部拟合能力互补优势,获得包含上下文理解的多粒度视觉信息,跨模态自适应融合块动态执行语义信息与多粒度视觉特征之间的交互协作,缩小模态间的特征差异;最后,正交双向门控循环单元建立多模态特征在垂直和水平方向上的文本依赖。实验结果表明,在TextZoom测试集上,本文提出的方法在PSNR和SSIM定量指标上相比于其他主流方法均有所提升,并且在ASTER,MORAN,CRNN 3种识别器的平均识别精度相比TPGSR模型分别提高了2.9%,3.6%和3.7%。由此表明,采用多模态语义交互方法的文本图像超分辨率重构,可以有效提高文本识别精度。The accurate extraction of text content from images is hindered by the absence of scale transfor-mation in feature representation and insufficient resolution,which misguides the reconstruction network.To address this challenge,this paper proposes a novel multi-modal semantic interactive text image super-resolution reconstruction method.By incorporating an attention mask within the semantic inference mod-ule,the method corrects text content information and employs semantic prior knowledge to constrain and guide the reconstruction of semantically accurate super-resolution text images.To enhance the network's representational capacity and accommodate text images of varying shapes and lengths,a multimodal se-mantic interaction block is introduced.This block consists of three key components:a visual dual-flow in-tegration module,a cross-modal adaptive fusion module,and an orthogonal bidirectional gated recurrent unit.First,the visual dual-flow integration module captures multi-granularity visual information,includ-ing contextual understanding,by leveraging the complementary strengths of global statistical features and robust local approximations.Next,the cross-modal adaptive fusion module dynamically facilitates interac-tion and alignment between semantic information and multi-granularity visual features,effectively reducing cross-modal feature discrepancies.Finally,the orthogonal bidirectional gated recurrent unit establish-es multimodal feature dependencies in both vertical and horizontal orientations.Experimental results on the TextZoom test set demonstrate that the proposed method outperforms state-of-the-art approach-es in terms of quantitative metrics,achieving significant improvements in PSNR and SSIM.Com-pared to the TPGSR model,the proposed method increases the average recognition accuracy of AS-TER,MORAN,and CRNN by 2.9%,3.6%,and 3.7%,respectively.These findings highlight the effectiveness of multimodal semantic interaction in enhancing text image super-resolution and improv-ing text recognition accuracy.

关键词：超分辨率重构文本图像多粒度语义先验多模态

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

多模态语义交互的文本图像超分辨率重构

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

多模态语义交互的文本图像超分辨率重构

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索