检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:杨帆 王志社[1] 孙婧 余朝发 YANG Fan;WANG Zhishe;SUN Jing;YU Zhaofa(School of Applied Science,Taiyuan University of Science and Technology,Taiyuan 030024,China;Ordnance NCO Academy,Army Engineering University of PLA,Wuhan 430075,China)
机构地区:[1]太原科技大学应用科学学院,太原030024 [2]陆军工程大学军械士官学校,武汉430075
出 处:《光子学报》2024年第6期214-225,共12页Acta Photonica Sinica
基 金:山西省基础研究计划(No.202203021221144)。
摘 要:针对现有红外与可见光图像融合方法仅仅依靠局部或全局特征表示,缺乏跨模态特征交互而造成融合性能低的问题,提出一种交互自注意力融合方法,利用Transformer对卷积神经网络提取的局部特征进行全局依赖关系建模,达到结合局部与全局关系的目的,提高特征表征能力。同时,构建了跨模态注意力交互模型,允许不同空间和独立通道之间以交互方式进行特征传递,以实现特征局部到全局的映射,从而增强两类图像的补充特性。在TNO、M3FD和Roadscene数据集上进行主客观实验,结果表明,与其他7种先进的融合方法相比,该方法在融合性能、模型泛化和计算效率方面都具有明显的优势,验证了方法的有效性和优越性。The fusion of infrared and visible images aims to merge their complementary information to generate a fused output with better visual perception and scene understanding.The existing CNN-based methods typically employ convolutional operations to extract local features while failing to model the long-range relationships.On the contrary,the Transformer-based methods usually propose a self-attention mechanism to model the global dependencies,but lack the supplement of local information.More importantly,these methods often ignore the specialized interactive information learning of different modalities,which produces limited fusion performance.To address these issues,this paper introduces an infrared and visible image fusion via interactive self-attention,namely ISAFusion.First,we devise a collaborative learning scheme that seamlessly integrates CNN and Transformer.This approach leverages residual convolutional blocks to extract local features,which are then aggregated into the transformer to model the global features,thus enhancing its powerful feature representation abilities.Second,we construct a cross-modality interactive attention module,which is a cascade of Token-ViT and Channel-ViT.This module can model the long-range dependencies from token and channel dimensions in an interactive manner,and allow feature communication between spatial locations and independent channels.The generated global features markedly focus on the intrinsic characteristics of different modality images,which can effectively strengthen their complementary information to achieve better fusion performance.Finally,we end-to-end train the fusion network through a comprehensive objective function encompassing the structural similarity index measure SSIM loss,gradient loss,and intensity loss.This design can ensure the fusion model preserves similar structural information,valuable pixel intensity,and rich texture details from source images.To verify the effectiveness and superiority of the proposed method,we carry out experiments on the three dif
关 键 词:图像融合 自注意力机制 特征交互 深度学习 多模态图像
分 类 号:TP391.4[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.28