检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:司马双霖 黄岩[1,3] 何科技 安东 袁辉 王亮[1,2,3,4,5] SIMA Shuang-Lin;HUANG Yan;HE Ke-Ji;AN Dong;YUAN Hui;WANG Liang(Center of Research on Intelligent Perception and Computing,Institute of Automation,Chinese Academy of Sciences,Beijing 100190;School of Artificial Intelligence,University of Chinese Academy of Sciences,Beijing 100049;National Laboratory of Pattern Recognition,Institute of Automation,Chinese Academy of Sciences,Beijing 100190;Center for Excellence in Brain Science and Intelligence Technology,Institute of Automation,Chinese Academy of Sciences,Shanghai 200031;Artificial Intelligence Research,Chinese Academy of Sciences,Jiaozhou 266300)
机构地区:[1]中国科学院自动化研究所智能感知与计算研究中心,北京100190 [2]中国科学院大学人工智能学院,北京100049 [3]中国科学院自动化研究所模式识别国家重点实验室,北京100190 [4]中国科学院自动化研究所脑科学与智能技术卓越创新中心,上海200031 [5]中科人工智能创新技术研究院,胶州266300
出 处:《自动化学报》2023年第1期1-14,共14页Acta Automatica Sinica
摘 要:视觉语言导航,即在一个未知环境中,智能体从一个起始位置出发,结合指令和周围视觉环境进行分析,并动态响应生成一系列动作,最终导航到目标位置.视觉语言导航有着广泛的应用前景,该任务近年来在多模态研究领域受到了广泛关注.不同于视觉问答和图像描述生成等传统多模态任务,视觉语言导航在多模态融合和推理方面,更具有挑战性.然而由于传统模仿学习的缺陷和数据稀缺的现象,模型面临着泛化能力不足的问题.系统地回顾了视觉语言导航的研究进展,首先对于视觉语言导航的数据集和基础模型进行简要介绍;然后全面地介绍视觉语言导航任务中的代表性模型方法,包括数据增强、搜索策略、训练方法和动作空间四个方面;最后根据不同数据集下的实验,分析比较模型的优势和不足,并对未来可能的研究方向进行了展望.Vision-and-language navigation means that an agent in an unknown environment,starting from a starting location,dynamically generates a series of actions by making analysis with language instructions and the visual environment,and finally navigates to the goal location.And due to the widespread application prospect,in recent years,it has received increasing attention from researchers especially in multi-modal research.It is different from traditional multi-modal tasks such as vision question answer and image captioning,vision-and-language navigation is more challenging in terms of dynamic reasoning and multi-modal fusion.However,with the limitations of imitation learning and the phenomenon of data scarcity,the model is faced with the problem of insufficient generalization.In this paper,we review the current advances in the research of vision-and-language navigation.Firstly,we briefly introduce data sets in visual-and-language navigation.Then,we comprehensively introduce the representative models in vision-and-language navigation,including data augmentation,search strategies,training methods and action spaces.Finally,from the experiments under different data sets,we analyze the advantages and disadvantages of the existing models,and prospect some future and possible research directions.
关 键 词:视觉语言导航 视觉语言理解 跨模态匹配 具身智能
分 类 号:TP391.41[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.200