基于跨模态多维关系增强的多模态模型研究  

Research on multi-modal model based on cross-modal multi-dimensional relationship enhancement

在线阅读下载全文

作  者:成曦 杨关[1,2] 刘小明[1,2] 刘阳 Cheng Xi;Yang Guan;Liu Xiaoming;Liu Yang(School of Computer Science,Zhengzhou 450007,China;Henan Key Laboratory on Public Opinion Intelligent Analysis,Zhongyuan University of Technology,Zhengzhou 450007,China;School of Telecommunications Engineering,Xidian University,Xi’an 710071,China)

机构地区:[1]中原工学院计算机学院,郑州450007 [2]中原工学院河南省网络舆情监测与智能分析重点实验室,郑州450007 [3]西安电子科技大学通讯工程学院,西安710071

出  处:《计算机应用研究》2023年第8期2367-2374,共8页Application Research of Computers

基  金:国家自然科学基金青年资助项目(61906141);河南省高等学校重点科研资助项目(23A520022);东北师范大学应用统计教育部重点实验室资助项目(135131007)。

摘  要:针对当前多模态模型不能充分挖掘图像中非显著区域的空间关系和上下文间的语义关系,导致多模态关系推理效果不佳的问题,提出了一个基于跨模态多维关系增强的多模态模型(multi-dimensional relationship enhancement model,MRE),用于提取潜层结构下图像各要素之间的空间关系信息,并推理出视觉—语言间的语义相关性。设计了特征多样性模块用于挖掘图像中与显著区域相关的次显著区域特征,从而增强图像空间关系特征表示。同时设计了上下文引导注意模块来引导模型学习语言上下文在图像中的关系,实现跨模态关系对齐。在MSCOCO数据集上的实验表明所提模型获得了更好的性能,其中BLEU-4和CIDEr分数分别提升了0.5%和1.3%。将这种方法应用到视觉问答任务中,在VQA 2.0数据集上性能得到了0.62%的提升,证明了该方法在多模态任务方面的广泛适用性。Aiming at the problem that the current multi-modal models can’t fully excavate the spatial relationship of non-significant regions and the semantic relationship between contexts,resulting in poor inference of multimodal relationship,this paper proposed a multi-modal model based on cross-modal multi-dimensional relationship enhancement,which was used to extract the spatial relation information between the image elements under the latent layer structure,and reasoning the semantic correlation between visual and language.Firstly,the model designed a feature diversity module to mine the sub-significant region features associated with significant regions in the image,thus enhancing the image spatial relationship feature representation.Secondly,it learned the context relationship of language in the image by the context guided attention module to achieve cross modal relationship alignment.Experiments on the MSCOCO dataset show that the proposed model achieves better performance,with BLEU-4 and CIDEr scores are improved by 0.5%and 1.3%,respectively.This approach is also applied to the visual question answering task,and the performance is improved by 0.62%on the VQA 2.0 dataset,which proves the wide applicability of the approach in multimodal tasks.

关 键 词:图像描述 视觉问答 特征多样性 空间关系 上下文语义关系 特征融合 多模态编码 

分 类 号:TP183[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象