检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:Qiang Sun Yan-Wei Fu Xiang-Yang Xue 孙强;付彦伟;薛向阳(School of Statistics and Information,Shanghai University of International Business and Economics,Shanghai 201620,China;Academy for Engineering and Technology,Fudan University,Shanghai 200433,China;School of Data Science,Fudan University,Shanghai 200433,China;School of Computer Science,Fudan University,Shanghai 200433,China)
机构地区:[1]School of Statistics and Information,Shanghai University of International Business and Economics,Shanghai 201620,China [2]Academy for Engineering and Technology,Fudan University,Shanghai 200433,China [3]School of Data Science,Fudan University,Shanghai 200433,China [4]School of Computer Science,Fudan University,Shanghai 200433,China
出 处:《Journal of Computer Science & Technology》2024年第4期912-928,共17页计算机科学技术学报(英文版)
基 金:supported in part by the National Natural Science Foundation of China under Grant No.62176061;the Science and Technology Commission of Shanghai Municipality under Grant No.22511105000.
摘 要:As a Turing test in multimedia,visual question answering(VQA)aims to answer the textual question with a given image.Recently,the“dynamic”property of neural networks has been explored as one of the most promising ways of improving the adaptability,interpretability,and capacity of the neural network models.Unfortunately,despite the prevalence of dynamic convolutional neural networks,it is relatively less touched and very nontrivial to exploit dynamics in the transformers of the VQA tasks through all the stages in an end-to-end manner.Typically,due to the large computation cost of transformers,researchers are inclined to only apply transformers on the extracted high-level visual features for downstream vision and language tasks.To this end,we introduce a question-guided dynamic layer to the transformer as it can effectively increase the model capacity and require fewer transformer layers for the VQA task.In particular,we name the dynamics in the Transformer as Conditional Multi-Head Self-Attention block(cMHSA).Furthermore,our questionguided cMHSA is compatible with conditional ResNeXt block(cResNeXt).Thus a novel model mixture of conditional gating blocks(McG)is proposed for VQA,which keeps the best of the Transformer,convolutional neural network(CNN),and dynamic networks.The pure conditional gating CNN model and the conditional gating Transformer model can be viewed as special examples of McG.We quantitatively and qualitatively evaluate McG on the CLEVR and VQA-Abstract datasets.Extensive experiments show that McG has achieved the state-of-the-art performance on these benchmark datasets.
关 键 词:visual question answering TRANSFORMER dynamic network
分 类 号:TP391.41[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.216.130.198