GPT-4 enhanced multimodal grounding for autonomous driving:Leveraging cross-modal attention with large language models  被引量:2

在线阅读下载全文

作  者:Haicheng Liao Huanming Shen Zhenning Li Chengyue Wang Guofa Li Yiming Bie Chengzhong Xu 

机构地区:[1]State Key Laboratory of Internet of Things for Smart City and Department of Computer and Information Science,University of Macao,Macao SAR,999078,China [2]Department of Information and Software Engineering,University of Electronic Science and Technology of China,Chengdu,610000,China [3]State Key Laboratory of Internet of Things for Smart City and Departments of Civil and Environmental Engineering and Computer and Information Science,University of Macao,Macao SAR,999078,China [4]State Key Laboratory of Internet of Things for Smart City and Departments of Civil and Environmental Engineering,University of Macao,Macao SAR,999078,China [5]College of Mechanical and Vehicle Engineering,Chongqing University,Chongqing,400030,China [6]School of Transportation,Jilin University,Changchun,130000,China

出  处:《Communications in Transportation Research》2024年第1期5-23,共19页交通研究通讯(英文)

基  金:This research is supported by the Science and Technology Development Fund of Macao SAR(Nos.0021/2022/ITP,0081/2022/A2,0015/2019/AKP,SKL-IoTSC(UM)-2021-2023/ORP/GA08/2022,and SKL-IoTSC(UM)-2024-2026/ORP/GA06/2023);University of Macao(No.SRG2023-00037-IOTSC).

摘  要:In the field of autonomous vehicles(AVs),accurately discerning commander intent and executing linguistic commands within a visual context presents a significant challenge.This paper introduces a sophisticated encoder-decoder framework,developed to address visual grounding in AVs.Our Context-Aware Visual Grounding(CAVG)model is an advanced system that integrates five core encoders—Text,Emotion,Image,Context,and Cross-Modal—with a multimodal decoder.This integration enables the CAVG model to adeptly capture contextual semantics and to learn human emotional features,augmented by state-of-the-art Large Language Models(LLMs)including GPT-4.The architecture of CAVG is reinforced by the implementation of multi-head cross-modal attention mechanisms and a Region-Specific Dynamic(RSD)layer for attention modulation.This architectural design enables the model to efficiently process and interpret a range of cross-modal inputs,yielding a comprehensive understanding of the correlation between verbal commands and corresponding visual scenes.Empirical evaluations on the Talk2Car dataset,a real-world benchmark,demonstrate that CAVG establishes new standards in prediction accuracy and operational efficiency.Notably,the model exhibits exceptional performance even with limited training data,ranging from 50%to 75%of the full dataset.This feature highlights its effectiveness and potential for deployment in practical AV applications.Moreover,CAVG has shown remarkable robustness and adaptability in challenging scenarios,including long-text command interpretation,low-light conditions,ambiguous command contexts,inclement weather conditions,and densely populated urban environments.

关 键 词:Autonomous driving Visual grounding Cross-modal attention Large language models Human-machine interaction 

分 类 号:TP3[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象