“三维视觉—语言”推理技术的前沿研究与最新趋势

Comprehensive survey on 3D visual-language understanding techniques

作　　者：雷印杰[1] 徐凯[2] 郭裕兰杨鑫[4] 武玉伟胡玮[6] 杨佳琪汪汉云 Lei Yinjie;Xu Kai;Guo Yulan;Yang Xin;Wu Yuwei;Hu Wei;Yang Jiaqi;Wang Hanyun(College of Electronics and Information Engineering,Sichuan University,Chengdu 610065,China;School of Computer Science,National University of Defense Technology,Changsha 410073,China;College of ElectronicScience and Technology,National University of Defense Technology,Changsha 410073,China;School of Computer Science andTechnology,Dalian University of Technology,Dalian 116081,China;School of Computer Science and Technology,Beijing Institute of Technology,Beijing 100081,China;Wangxuan Institute of Computer Technology,Peking University,Beijing 100091,China;School of Computer Science,Northwestern Polytechnical University,Xi’an 710072,China;College of Computer and Data Science/College of Software,University of Information and Technology,Zhengzhou 450001,China)

机构地区：[1]四川大学电子信息学院,成都610065 [2]国防科技大学计算机学院,长沙410073 [3]国防科技大学电子科学学院,长沙410073 [4]大连理工大学计算机科学与技术学院,大连116081 [5]北京理工大学计算机学院,北京100081 [6]北京大学王选计算机研究所,北京100091 [7]西北工业大学计算机学院,西安710072 [8]信息工程大学计算机与大数据学院/软件学院,郑州450001

出　　处：《中国图象图形学报》2024年第6期1747-1764,共18页Journal of Image and Graphics

基　　金：国家自然科学基金项目(U23B2013,62276176)。

摘　　要：三维视觉推理的核心思想是对点云场景中的视觉主体间的关系进行理解。非专业用户难以向计算机传达自己的意图,从而限制了该技术的普及与推广。为此,研究人员以自然语言作为语义背景和查询条件反映用户意图,进而与点云的信息进行交互以完成相应的任务。此种范式称做“三维视觉—语言”推理,在自动驾驶、机器人导航以及人机交互等众多领域广泛应用,已经成为计算机视觉领域中备受瞩目的研究方向。过去几年间,“三维视觉—语言”推理技术迅猛发展,呈现出百花齐放的趋势,但是目前依然缺乏对最新研究进展的全面总结。本文聚焦于两类最具代表性的研究工作,锚框预测和内容生成类的“三维视觉—语言”推理技术,系统性概括领域内研究的最新进展。首先,本文总结了“三维视觉—语言”推理的问题定义和现存挑战,同时概述了一些常见的骨干网络。其次,本文按照方法所关注的下游场景,对两类“三维视觉—语言”推理技术做了进一步细分,并深入探讨了各方法的优缺点。接下来,本文对比分析了各类方法在不同基准数据集上的性能。最后,本文展望了“三维视觉—语言”推理技术的未来发展前景,以期促进该领域的深入研究与广泛应用。The core of 3D visual reasoning is to understand the relationships among different visual entities in point cloudscenes.Traditional 3D visual reasoning typically requires users to possess professional expertise.However,nonprofes⁃sional users face difficulty conveying their intentions to computers,which hinders the popularization and advancement ofthis technology.Users now anticipate a more convenient way to convey their intentions to the computer for achieving infor⁃mation exchange and gaining personalized results.Researchers utilize natural language as a semantic background or querycriteria to reflect user intentions for addressing the aforementioned issue.They further accomplish various missions by inter⁃acting such natural language with 3D point clouds.By multimodal interaction,often employing techniques such as theTransformer or graph neural network,current approaches not only can locate the entities mentioned by users(e.g.,visualgrounding and open-vocabulary recognition)but also can generate user-required content(e.g.,dense captioning,visualquestion answering,and scene generation).Specifically,3D visual grounding is intended to locate desired objects orregions in the 3D point cloud scene based on the object-related linguistic query.Open-vocabulary 3D recognition aims toidentify and localize 3D objects of novel classes defined by an unbounded(open)vocabulary at inference,which can gener⁃alize beyond the limited number of base classes labeled during the training phase.3D dense captioning aims to identify allpossible instances within the 3D point cloud scene and generate the corresponding natural language description for eachinstance.The goal of 3D visual question answering is to comprehend an entire 3D scene and provide an appropriateanswer.Text-guided scene generation is to synthesize a realistic 3D scene composed of complex background and multipleobjects from natural language descriptions.The aforementioned paradigm,which is known as 3D visual-language under⁃standing,has gained significant traction

关键词：深度学习计算机视觉 “三维视觉—语言”推理跨模态学习视觉定位密集字幕生成视觉问答场景生成

分类号：TP399[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

“三维视觉—语言”推理技术的前沿研究与最新趋势

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

“三维视觉—语言”推理技术的前沿研究与最新趋势

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索