一种基于双重语义协作网络的图像描述方法  

An Image Captioning Method Based on DSC-Net

在线阅读下载全文

作  者:江泽涛[1] 朱文才 金鑫 廖培期 黄景帆 Jiang Zetao;Zhu Wencai;Jin Xin;Liao Peiqi;Huang Jingfan(Guangxi Key Laboratory of Image and Graphic Intelligent Processing(Guilin University of Electronic Technology),Guilin,Guangxi 541004)

机构地区:[1]广西图像图形智能处理重点实验室(桂林电子科技大学),广西桂林541004

出  处:《计算机研究与发展》2024年第11期3897-3908,共12页Journal of Computer Research and Development

基  金:国家自然科学基金项目(62172118);广西自然科学基金重点项目(2021GXNSFDA196002);广西图像图形智能处理重点实验项目(GIIP2302,GIIP2303,GIIP2304);广西研究生教育创新计划项目(YCSW2022269);桂林电子科技大学研究生教育创新计划项目(2023YCXS046)。

摘  要:CLIP(contrastive language-image pre-training)视觉编码器提取的网格特征作为一种更加靠近文本域的视觉特征,具有易转化为对应语义自然语言的特点,可以缓解语义鸿沟问题,因而未来可能成为图像描述任务中视觉特征的重要来源.但该方法中未考虑图像内容的划分,可能使一个完整的目标被划分到若干个网格中,目标被切割势必会导致特征提取结果中缺少对目标信息的完整表达,进而导致生成的描述语句中缺少对目标及目标间关系的准确表述.针对CLIP视觉编码器提取网格特征这一现象,提出一种基于双重语义协作网络(dual semantic collaborative network,DSC-Net)的图像描述方法.具体来说:首先提出双重语义协作自注意力(dual semantic collaborative self-attention,DSCS)模块增强CLIP网格特征对目标信息的表达能力;接着提出双重语义协作交叉注意力(dual semantic collaborative cross-attention,DSCC)模块,综合网格和目标2个层面的语义构造与文本相关的视觉特征,进行描述语句预测;最后提出双重语义融合(dual semantic fusion,DSF)模块,为上述的2个语义协作模块提供以区域为主导的融合特征,解决在语义协作过程中可能出现的相关性冲突问题.经过在COCO数据集上的大量实验,提出的模型在Karpathy等人划分的离线测试集上取得了138.5%的CIDEr分数,在官方在线测试中取得了137.6%的CIDEr分数,与目前主流的图像描述方法相比具有显著优势.As visual features closer to the text domain,the grid features extracted by the CLIP(contrastive languageimage pre-training)image encoder are easy to convert into the corresponding semantic natural language,which can alleviate the semantic gap problem,so it may become an important source of visual features in the image captioning in the future.However,this method does not consider that the division of image content may cause a complete object to be divided into several grids.The segmentation of the objects will inevitably lead to the lack of a complete expression of the object information in the feature extraction results,and further lead to the lack of an accurate expression of the object and the relationship between the objects in the generated sentence.Aiming at the phenomenon of grid features extracted by CLIP image encoder,we propose dual semantic collaborative network(DSC-Net)for image captioning.Specifically,dual semantic collaborative self-attention(DSCS)module is first proposed to enhance the expression of object information by CLIP grid features.Then dual semantic collaborative cross-attention(DSCC)module is proposed to integrate semantic information between grid and object to generate visual features,and to be used to predict sentences.Finally,dual semantic fusion(DSF)module is proposed to provide region-oriented fusion features for the above two semantic cooperation modules,and to solve the problem of correlation conflicts that may arise in the process of semantic cooperation.After a large number of experiments on the COCO dataset,the proposed model achieves a CIDEr score of 138.5%on the offline test set divided by Karpathy et al.,and a CIDEr score of 137.6%in the official online test.Compared with the current mainstream image captioning methods,this result has obvious advantages.

关 键 词:图像描述 网格特征 注意力机制 双重语义协作注意力 双重语义协作特征融合 

分 类 号:TP391.41[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象