表格识别技术研究进展被引量：21

A survey on table recognition technology

作　　者：高良才[1] 李一博都林[2] 张新鹏朱子仪卢宁[2] 金连文[3] 黄永帅汤帜[1] Gao Liangcai;Li Yibo;Du Lin;Zhang Xinpeng;Zhu Ziyi;Lu Ning;Jin Lianwen;Huang Yongshuai;Tang Zhi(Wangxuan Computer Institute,Peking University,Beijing 100871,China;Huawei AI Application Research Center,Huawei Technology Co.,Ltd.,Beijing 100085,China;School of Electronics and Information Engineering,South China University of Technology,Guangzhou 510640,China)

机构地区：[1]北京大学王选计算机研究所,北京100871 [2]华为技术有限公司AI应用研究中心,北京100085 [3]华南理工大学电子与信息学院,广州510640

出　　处：《中国图象图形学报》2022年第6期1898-1917,共20页Journal of Image and Graphics

基　　金：国家重点研发计划资助(2019YFB1406303)。

摘　　要：表格广泛存在于科技文献、财务报表、报纸杂志等各类文档中,用于紧凑地存储和展现数据,蕴含着大量有用信息。表格识别是表格信息再利用的基础,具有重要的应用价值,也一直是模式识别领域的研究热点之一。随着深度学习的发展,针对表格识别的新研究和新方法纷纷涌现。然而,由于表格应用场景广泛、样式众多、图像质量参差不齐等因素,表格识别领域仍然存在着大量问题亟需解决。为了更好地总结前人工作,为后续研究提供支持,本文围绕表格区域检测、结构识别和内容识别等3个表格识别子任务,从传统方法、深度学习方法等方面,综述该领域国内外的发展历史和最新进展。梳理了表格识别相关数据集及评测标准,并基于主流数据集和标准,分别对表格区域检测、结构识别、表格信息抽取的典型方法进行了性能比较。然后,对比分析了国内相对于国外,在表格识别方面的研究进展与水平。最后,结合表格识别领域目前面临的主要困难与挑战,对未来的研究趋势和技术发展目标进行了展望。Optimal data access and massive data derived information extraction has become an essential technology nowadays.Table-related paradigm is a kind of efficient structure for the clustered data designation,display and analysis.It has been widely used on Internet and vertical fields due to its simplicity and intuitiveness.Computer based tables,pictures or portable document format(PDF)files as the carrier will cause structural information loss.It is challenged to trace the original tables back.Inefficient manual based input has more errors.Therefore,two decadal researches have focused on the computer automatic recognition of tables issues originated from documents or PDF files and multiple tasks loop.To obtain the table structure and content and extract specific information,table recognition aims to detect the table via the image or PDF and other electronic files automatically.It is composed of three tasks recognition types like table area detection,table structure recognition and table content recognition.There are two types of existed table recognition methods in common.One is based on optical character recognition(OCR)technology to recognize the characters in the table directly,and then analyze and identify the position of the characters.The other one is to obtain the key intersections and the positions of each frameline of the table through digital image processing to analyze the relationship between cells in the table.However,most of these methods are only applicable to a single field and have poor generalization ability.At the same time,it is constrained of some experience-based threshold design.Thanks to the development of deep learning technology,semantic segmentation algorithm,object detection algorithm,text sequence generation algorithm,pre training model and related technologies facilitates technical problem solving for table recognition.Most deep learning algorithms have carried out adaptive transformation according to the characteristics of tables,which can improve the effect of table recognition.It uses

关键词：表格区域检测表格结构识别表格内容识别深度学习单元格识别表格信息抽取

分类号：TP391.4[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

表格识别技术研究进展被引量：21

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

表格识别技术研究进展 被引量：21

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

表格识别技术研究进展被引量：21