检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:韩玉兰 罗轶宏 崔玉杰 兰朝凤 HAN Yulan;LUO Yihong;CUI Yujie;LAN Chaofeng(College of Measurement and Control Technology and Communication Engineering,Harbin University of Science and Technology,Harbin 150080,China)
机构地区:[1]哈尔滨理工大学测控技术与通信工程学院,黑龙江哈尔滨150080
出 处:《光学精密工程》2025年第1期135-147,共13页Optics and Precision Engineering
基 金:国家自然科学基金资助项目(No.11804068);黑龙江省自然科学基金资助项目(No.LH2020F033);黑龙江省省属高等学校基本科研业务资助项目(No.2020-KYYWF-0342)。
摘 要:针对现有方法在文本图像特征表示缺乏尺度变换,分辨率不足导致识别器难以提取到正确的文本内容信息指导重构网络的问题,提出多模态语义交互的文本图像超分辨率重构方法。利用语义推理模块中的注意力掩码对文本内容信息进行校正,获得语义先验信息,约束并指导网络重构语义正确的文本超分辨率重构图像。为增强网络的表征能力,适应不同形状和长度的文本图像,设计了多模态语义交互块,其基本单元由视觉双流集成块、跨模态自适应融合块和正交双向门控循环单元组成。视觉双流集成块利用全局统计特性和局部拟合能力互补优势,获得包含上下文理解的多粒度视觉信息,跨模态自适应融合块动态执行语义信息与多粒度视觉特征之间的交互协作,缩小模态间的特征差异;最后,正交双向门控循环单元建立多模态特征在垂直和水平方向上的文本依赖。实验结果表明,在TextZoom测试集上,本文提出的方法在PSNR和SSIM定量指标上相比于其他主流方法均有所提升,并且在ASTER,MORAN,CRNN 3种识别器的平均识别精度相比TPGSR模型分别提高了2.9%,3.6%和3.7%。由此表明,采用多模态语义交互方法的文本图像超分辨率重构,可以有效提高文本识别精度。The accurate extraction of text content from images is hindered by the absence of scale transfor-mation in feature representation and insufficient resolution,which misguides the reconstruction network.To address this challenge,this paper proposes a novel multi-modal semantic interactive text image super-resolution reconstruction method.By incorporating an attention mask within the semantic inference mod-ule,the method corrects text content information and employs semantic prior knowledge to constrain and guide the reconstruction of semantically accurate super-resolution text images.To enhance the network's representational capacity and accommodate text images of varying shapes and lengths,a multimodal se-mantic interaction block is introduced.This block consists of three key components:a visual dual-flow in-tegration module,a cross-modal adaptive fusion module,and an orthogonal bidirectional gated recurrent unit.First,the visual dual-flow integration module captures multi-granularity visual information,includ-ing contextual understanding,by leveraging the complementary strengths of global statistical features and robust local approximations.Next,the cross-modal adaptive fusion module dynamically facilitates interac-tion and alignment between semantic information and multi-granularity visual features,effectively reducing cross-modal feature discrepancies.Finally,the orthogonal bidirectional gated recurrent unit establish-es multimodal feature dependencies in both vertical and horizontal orientations.Experimental results on the TextZoom test set demonstrate that the proposed method outperforms state-of-the-art approach-es in terms of quantitative metrics,achieving significant improvements in PSNR and SSIM.Com-pared to the TPGSR model,the proposed method increases the average recognition accuracy of AS-TER,MORAN,and CRNN by 2.9%,3.6%,and 3.7%,respectively.These findings highlight the effectiveness of multimodal semantic interaction in enhancing text image super-resolution and improv-ing text recognition accuracy.
关 键 词:超分辨率重构 文本图像 多粒度 语义先验 多模态
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.7