基于Tesseract-OCR的复杂发票自适应识别  被引量:7

Adaptive recognition of complex invoices based on Tesseract-OCR

在线阅读下载全文

作  者:孙瑞彬 钱夔 徐伟敏 路红[1] SUN Ruibin;QIAN Kui;XU Weimin;LU Hong(School of Automation,Nanjing Institute of Technology,Nanjing 211167;Nanjing Xuefu Ruijie Information Technology Company Limited,Nanjing 210009)

机构地区:[1]南京工程学院自动化学院,南京211167 [2]南京学府睿捷信息科技有限公司,南京210009

出  处:《南京信息工程大学学报(自然科学版)》2021年第3期349-354,共6页Journal of Nanjing University of Information Science & Technology(Natural Science Edition)

基  金:南京工程学院引进人才科研启动基金(YKJ201918);南京工程学院校级科研基金(CXY201930)。

摘  要:针对复杂发票任意区域下的特定表格内容提取与实时识别问题,提出了一种基于Tesseract-OCR引擎的自适应识别方法.首先利用OpenCV对发票图像进行预处理滤波、自适应阈值等一系列预处理得到二值图像;然后利用形态学中的开运算提取表格全域线段,进行表格位置提取,并结合表格交点坐标与自定义模板,实现表头与内容自适应适配;最后利用jTessBoxEditor对表格区域内容进行字库训练优化,最终实现基于Tesseract-OCR的字符识别.实验结果表明该方法具有高准确识别率,支持感兴趣区域自适应识别,具备高可用性.An adaptive recognition method based on Tesseract-OCR engine is proposed to solve the problem of extracting and real-time recognition of specific table items in any region of complex invoices.First,the invoice image is preprocessed by OpenCV for filtering,adaptive threshold,etc.,to get a binary image.Then,the open operation in morphology is used to extract the global line segments and position of the table.The coordinates of the intersection points of the table is combined with the custom template to realize the adaptive adaptation between the table header and the content.Then the jTessBoxEditor is used to train and optimize the content of the table items,and finally the character recognition based on Tesseract-OCR is realized.The experimental results show that this method has high accurate recognition rate,supports the adaptive recognition of ROI(Region of Interest),and is highly available.

关 键 词:发票识别 Tesseract-OCR OPENCV 字库训练 自适应识别 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象