检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:黄勇韬 严华[1] HUANG Yong-tao;YAN Hua(School of Electronics and Information Engineering,Sichuan University,Chengdu 610065,China)
出 处:《计算机科学》2020年第6期133-137,共5页Computer Science
基 金:国家自然科学基金项目(61403265)。
摘 要:视觉场景理解不仅可以孤立地识别单个物体,还可以得到不同物体之间的相互作用关系。场景图可以获取所有的(主语-谓词-宾语)信息来描述图像内部的对象关系,在场景理解任务中应用广泛。然而,大部分已有的场景图生成模型结构复杂、推理速度慢、准确率低,不能在现实情况下直接使用。因此,在Factorizable Net的基础上提出了一种结合注意力机制与特征融合的场景图生成模型。首先把整个图片分解为若干个子图,每个子图包含多个对象及对象间的关系;然后在物体特征中融合其位置和形状信息,并利用注意力机制实现物体特征和子图特征之间的消息传递;最后根据物体特征和子图特征分别进行物体分类和物体间关系推断。实验结果表明,在多个视觉关系检测数据集上,该模型视觉关系检测的准确率为22.78%~25.41%,场景图生成的准确率为16.39%~22.75%,比Factorizable Net分别提升了1.2%和1.8%;并且利用一块GTX1080Ti显卡可以在0.6 s之内实现对一幅图像的物体和物体间的关系进行检测。实验数据充分说明,采用子图结构明显减少了需要进行关系推断的图像区域数量,利用特征融合方法和基于注意力机制的消息传递机制提升了深度特征的表现能力,可以更快速准确地预测对象及其关系,从而有效解决了传统的场景图生成模型时效性差、准确度低的难题。Understanding a visual scene can not only identify a single object in isolation,but also get the interaction between different objects.Generating scene graph can obtain all the tuples(subject-predicate-object)and describe the object relationships inside an image,which is widely used in image understanding tasks.To solve the problem that the existing scene graph generation models use complicated structures with slow inference speed,a scene graph generation model combining attention mechanism and feature fusion with Factorizable Net structure was proposed.Firstly,a image is decomposed into subgraphs,where each subgraph contains several objects and their relationships.Then,the position and shape information is merged in the object features,and the attention mechanism is used to realize the message transmission between the object features and the subgraph features.Finally,the object classification and the relationship between the objects are inferred according to the object features and the subgraph features.The experimental results show that the accuracy of the visual relationship detection is 22.78%to 25.41%,and the accuracy of the scene graph generation is 16.39%to 22.75%,which is 1.2%and 1.8%higher than Factorizable Net on multiple vi-sual relationship detection datasets.Besides,the proposed model can perform object relationship detection task in 0.6 seconds with a GTX 1080Ti graphics.The results demonstrate that the number of image regions to be inferred is significantly reduced by using the subgraph structure.The feature fusion method and the attention mechanism are used to improve the performance of depth features,so the objects and their relationships can be predicted more quickly and accurately.Therefore,it solves the problem of poor timeliness and low accuracy in the traditional scene graph generation models.
关 键 词:场景图 视觉关系检测 注意力机制 消息传递 特征融合
分 类 号:TP391.4[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.15