Learning a Mixture of Conditional Gating Blocks for Visual Question Answering  

在线阅读下载全文

作  者:Qiang Sun Yan-Wei Fu Xiang-Yang Xue 孙强;付彦伟;薛向阳(School of Statistics and Information,Shanghai University of International Business and Economics,Shanghai 201620,China;Academy for Engineering and Technology,Fudan University,Shanghai 200433,China;School of Data Science,Fudan University,Shanghai 200433,China;School of Computer Science,Fudan University,Shanghai 200433,China)

机构地区:[1]School of Statistics and Information,Shanghai University of International Business and Economics,Shanghai 201620,China [2]Academy for Engineering and Technology,Fudan University,Shanghai 200433,China [3]School of Data Science,Fudan University,Shanghai 200433,China [4]School of Computer Science,Fudan University,Shanghai 200433,China

出  处:《Journal of Computer Science & Technology》2024年第4期912-928,共17页计算机科学技术学报(英文版)

基  金:supported in part by the National Natural Science Foundation of China under Grant No.62176061;the Science and Technology Commission of Shanghai Municipality under Grant No.22511105000.

摘  要:As a Turing test in multimedia,visual question answering(VQA)aims to answer the textual question with a given image.Recently,the“dynamic”property of neural networks has been explored as one of the most promising ways of improving the adaptability,interpretability,and capacity of the neural network models.Unfortunately,despite the prevalence of dynamic convolutional neural networks,it is relatively less touched and very nontrivial to exploit dynamics in the transformers of the VQA tasks through all the stages in an end-to-end manner.Typically,due to the large computation cost of transformers,researchers are inclined to only apply transformers on the extracted high-level visual features for downstream vision and language tasks.To this end,we introduce a question-guided dynamic layer to the transformer as it can effectively increase the model capacity and require fewer transformer layers for the VQA task.In particular,we name the dynamics in the Transformer as Conditional Multi-Head Self-Attention block(cMHSA).Furthermore,our questionguided cMHSA is compatible with conditional ResNeXt block(cResNeXt).Thus a novel model mixture of conditional gating blocks(McG)is proposed for VQA,which keeps the best of the Transformer,convolutional neural network(CNN),and dynamic networks.The pure conditional gating CNN model and the conditional gating Transformer model can be viewed as special examples of McG.We quantitatively and qualitatively evaluate McG on the CLEVR and VQA-Abstract datasets.Extensive experiments show that McG has achieved the state-of-the-art performance on these benchmark datasets.

关 键 词:visual question answering TRANSFORMER dynamic network 

分 类 号:TP391.41[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象