检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:张言 李强[1,2] 申化文 曾港艳 周宇 马灿[1,2] 张远 王伟平[1,2] Zhang Yan;Li Qiang;Shen Huawen;Zeng Gangyan;Zhou Yu;Ma Can;Zhang Yuan;Wang Weiping(Institute of Information Engineering,Chinese Academy of Sciences,Beijing 100093,China;School of Cyber Security,University of Chinese Academy of Sciences,Beijing 101408,China;State Key Laboratory of Media Convergence and Communication,Communication University of China,Beijing 100024,China)
机构地区:[1]中国科学院信息工程研究所,北京100093 [2]中国科学院大学网络空间安全学院,北京101408 [3]中国传媒大学媒体融合与传播国家重点实验室,北京100024
出 处:《中国图象图形学报》2023年第8期2253-2275,共23页Journal of Image and Graphics
基 金:中国科学院基础前沿科学研究计划从0到1原始创新项目(ZDBS-LY-7024)。
摘 要:文字广泛存在于各种文档图像和自然场景图像之中,蕴含着丰富且关键的语义信息。随着深度学习的发展,研究者不再满足于只获得图像中的文字内容,而更加关注图像中文字的理解,故以文字为中心的图像理解技术受到越来越多的关注。该技术旨在利用文字、视觉物体等多模态信息对文字图像进行充分理解,是计算机视觉和自然语言处理领域的一个交叉研究方向,具有十分重要的实际意义。本文主要对具有代表性的以文字为中心的图像理解任务进行综述,并按照理解认知程度,将以文字为中心的图像理解任务划分为两类,第1类仅要求模型具备抽取信息的能力,第2类不仅要求模型具备抽取信息的能力,而且要求模型具备一定的分析和推理能力。本文梳理了以文字为中心的图像理解任务所涉及的数据集、评价指标和经典方法,并进行对比分析,提出了相关工作中存在的问题和未来发展趋势,希望能够为后续相关研究提供参考。Text can be as one of the key carriers for information transmission.Digital media-related text has been widely developing for such image aspects of document and scene contexts.To extract and analyze these text information-involved images automatically,Conventional researches are mainly focused on automatic text extraction techniques like scene text detection and recognition.However,text-centric images-based semantic information recognition or analysis as a downstream task of spotting text,remains a challenge due to the difficulty of fully leveraging multi-modal features from both vision and language.To this end,text-centric image understanding has been an emerging research topic and many related tasks have been proposed.For example,the visual information extraction technique is capable of extracting the specified content from the given image,which can be used to improve productivity in finance,social media,and other fields.In this paper,we introduce five representative text-centric image understanding tasks and conduct a systematic survey on them.According to the understanding level,these tasks can be broadly classified into two categories.The first category requires the basic understanding ability to extract and distinguish information,such as visual information extraction and scene text retrieval.In contrast,besides the fundamental understanding ability,the second category is more concerned with highlevel semantic understanding capabilities like information aggregation and logical reasoning.With the research progress in deep learning and multimodal learning,the second category has attracted considerable attention recently.For the second category,this survey mainly introduces document visual question answering,scene text visual question answering,and scene text image captioning tasks.Over the past few decades,the development of text-centric image understanding techniques has gone through several stages.Earlier approaches are based on heuristic rules and may only utilize unimodal features.Currently,deep learning
关 键 词:文字图像理解 视觉信息抽取 场景文字图像检索 文档视觉回答 场景文字视觉问答 场景文字图像描述
分 类 号:TP391.4[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.135.237.153