检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:Alaa Thobhani Beiji Zou Xiaoyan Kui Amr Abdussalam Muhammad Asim Mohammed ELAffendi Sajid Shah
机构地区:[1]School of Computer Science and Engineering,Central South University,Changsha,410083,China [2]Electronic Engineering and Information ScienceDepartment,University of Science and Technology of China,Hefei,230026,China [3]EIAS Data Science Lab,College of Computer and Information Sciences,Prince Sultan University,Riyadh,11586,Saudi Arabia
出 处:《Computers, Materials & Continua》2025年第3期3943-3964,共22页计算机、材料和连续体(英文)
基 金:supported by the National Natural Science Foundation of China(Nos.U22A2034,62177047);High Caliber Foreign Experts Introduction Plan funded by MOST,and Central South University Research Programme of Advanced Interdisciplinary Studies(No.2023QYJC020).
摘 要:Image captioning,the task of generating descriptive sentences for images,has advanced significantly with the integration of semantic information.However,traditional models still rely on static visual features that do not evolve with the changing linguistic context,which can hinder the ability to form meaningful connections between the image and the generated captions.This limitation often leads to captions that are less accurate or descriptive.In this paper,we propose a novel approach to enhance image captioning by introducing dynamic interactions where visual features continuously adapt to the evolving linguistic context.Our model strengthens the alignment between visual and linguistic elements,resulting in more coherent and contextually appropriate captions.Specifically,we introduce two innovative modules:the Visual Weighting Module(VWM)and the Enhanced Features Attention Module(EFAM).The VWM adjusts visual features using partial attention,enabling dynamic reweighting of the visual inputs,while the EFAM further refines these features to improve their relevance to the generated caption.By continuously adjusting visual features in response to the linguistic context,our model bridges the gap between static visual features and dynamic language generation.We demonstrate the effectiveness of our approach through experiments on the MS-COCO dataset,where our method outperforms state-of-the-art techniques in terms of caption quality and contextual relevance.Our results show that dynamic visual-linguistic alignment significantly enhances image captioning performance.
关 键 词:Image-captioning visual attention deep learning visual features
分 类 号:TP3[自动化与计算机技术—计算机科学与技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.220.216.164