检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:王鑫 宋永红[2] 张元林[2] WANG Xin;SONG Yong-Hong;ZHANG Yuan-Lin(School of Software Engineering,Xi'an Jiaotong University,Xi'an 710049;College of Artificial Inteligence,Xi'an Jiao-tong University,Xi'an 710049)
机构地区:[1]西安交通大学软件学院,西安710049 [2]西安交通大学人工智能学院,西安710049
出 处:《自动化学报》2022年第3期735-746,共12页Acta Automatica Sinica
基 金:陕西省自然科学基础研究计划(2018JM6104);国家重点研究开发项目(2017YFB1301101)资助。
摘 要:图像描述(Image captioning)是一个融合了计算机视觉和自然语言处理这两个领域的研究方向,本文为图像描述设计了一种新颖的显著性特征提取机制(Salient feature extraction mechanism,SFEM),能够在语言模型预测每一个单词之前快速地向语言模型提供最有价值的视觉特征来指导单词预测,有效解决了现有方法对视觉特征选择不准确以及时间性能不理想的问题.SFEM包含全局显著性特征提取器和即时显著性特征提取器这两个部分:全局显著性特征提取器能够从多个局部视觉向量中提取出显著性视觉特征,并整合这些特征到全局显著性视觉向量中;即时显著性特征提取器能够根据语言模型的需要,从全局显著性视觉向量中提取出预测每一个单词所需的显著性视觉特征.本文在MS COCO(Microsoft common objects in context)数据集上对SFEM进行了评估,实验结果表明SFEM能够显著提升基准模型(baseline)生成图像描述的准确性,并且SFEM在生成图像描述的准确性方面明显优于广泛使用的空间注意力模型,在时间性能上也大幅领先空间注意力模型.Image captioning is a research direction that combines computer vision and natural language processing.In this paper,a novel saliency feature extraction mechanism(SFEM)is designed to solve several key problems existing in current methods.It can quickly provide the most valuable visual features to the language model before which predict word.And it effectively solves the problems that the existing methods are inaccurate in selecting visual features and time-consuming.SFEM consists of global salient feature extractor and instant salient feature extractor:global salient Feature extractor extracts salient visual features from multiple local visual vectors and integrate these features into a global salient visual vector;the instant salient feature extractor can extract the saliency visual features required at each moment from the global saliency visual vector according to the needs of the language model.We evaluated SFEM on the MS COCO(Microsoft common objects in context)dataset.Experiments show that our SFEM can significantly improve the accuracy of baseline in caption generating.And SFEM is significantly better than the widely used spatial attention model in both the accuracy of generating caption and time performance.
关 键 词:图像描述 显著性特征提取 语言模型 编码器 解码器
分 类 号:TP391.41[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.116.14.133