基于多重注意结构的图像密集描述生成方法研究  被引量:1

Dense Captioning Method Based on Multi-attention Structure

在线阅读下载全文

作  者:刘青茹 李刚[1,2] 赵创 顾广华 赵耀[3] LIU Qing-Ru;LI Gang;ZHAO Chuang;GU Guang-Hua;ZHAO Yao(School of Information Science and Engineering,Yanshan University,Qinhuangdao 066004;Hebei Provincial Key Laboratory of Information Transmission and Signal Processing,Qinhuangdao 066004;Institute of Information Science,Beijing Jiaotong University,Beijing 100044)

机构地区:[1]燕山大学信息科学与工程学院,秦皇岛066004 [2]河北省信息传输与信号处理重点实验室,秦皇岛066004 [3]北京交通大学信息科学研究所,北京100044

出  处:《自动化学报》2022年第10期2537-2548,共12页Acta Automatica Sinica

基  金:国家自然科学基金(62072394);河北省自然科学基金(F2021203019);河北省重点实验室项目(202250701010046)资助。

摘  要:图像密集描述旨在为复杂场景图像提供细节描述语句.现有研究方法虽已取得较好成绩,但仍存在以下两个问题:1)大多数方法仅将注意力聚焦在网络所提取的深层语义信息上,未能有效利用浅层视觉特征中的几何信息;2)现有方法致力于改进感兴趣区域间上下文信息的提取,但图像内物体空间位置信息尚不能较好体现.为解决上述问题,提出一种基于多重注意结构的图像密集描述生成方法—MAS-ED (Multiple attention structure-encoder decoder). MAS-ED通过多尺度特征环路融合(Multi-scale feature loop fusion, MFLF)机制将多种分辨率尺度的图像特征进行有效集成,并在解码端设计多分支空间分步注意力(Multi-branch spatial step attention, MSSA)模块,以捕捉图像内物体间的空间位置关系,从而使模型生成更为精确的密集描述文本.实验在Visual Genome数据集上对MAS-ED进行评估,结果表明MASED能够显著提升密集描述的准确性,并可在文本中自适应加入几何信息和空间位置信息.基于长短期记忆网络(Longshort term memory, LSTM)解码网络框架, MAS-ED方法性能在主流评价指标上优于各基线方法.Dense captioning aims to provide detailed description sentences for complex scenes. Although the existing research methods have achieved good results, there are still the following two problems: 1) Most methods only focus on the deep semantic information extracted by the network, and fail to effectively utilize the geometric information in the shallow visual features. 2) Existing methods are dedicated to improving the extraction of contextual information between regions of interest, but the spatial location information of objects in images cannot be well represented. To solve the above problems, this paper proposes a dense captioning generation method based on multiple attention structure-encoder decoder(MAS-ED). MAS-ED effectively integrates image features of multiple resolution scales through a multi-scale feature loop fusion(MFLF) mechanism, and designs a multi-branch spatial step attention(MSSA) at the decoding end to capture the spatial relationship between objects in the image, this enables the method model to generate more accurate dense description text. In this paper, MAS-ED is evaluated on the Visual Genome dataset. The experimental results show that MAS-ED can significantly improve the accuracy of dense captions, and can adaptively add geometric information and spatial location information to the text. Based on the long-short term memory(LSTM) decoding network framework, the performance of the MAS-ED method in this paper outperforms all baseline methods in mainstream evaluation indicators.

关 键 词:图像密集描述 多重注意结构 多尺度特征环路融合 多分支空间分步注意力 

分 类 号:TP391.41[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象