检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:王平 张晓峰[2] 王宜怀[3] 程仁贵 WANG Ping;ZHANG Xiao-Feng;WANG Yi-Huai;CHENG Ren-Gui(School of Mathematics and Computer Science,Wuyi University,Wuyishan 354300,China;School of Information Science and Technology,Nantong University,Nantong 226019,China;School of Computer Science and Technology,Soochow University,Suzhou 215006,China;Fujian Provincial Key Laboratory of Cognitive Computing and Intelligent Information Processing,Wuyishan 354300,China)
机构地区:[1]武夷学院数学与计算机学院,武夷山354300 [2]南通大学信息科学技术学院,南通226019 [3]苏州大学计算机科学与技术学院,苏州215006 [4]认知计算与智能信息处理福建省高校重点实验室,武夷山354300
出 处:《计算机系统应用》2019年第11期238-244,共7页Computer Systems & Applications
基 金:国家自然科学基金(61672369);中央引导地方科技发展专项(2018L3013);福建省自然科学基金面上项目(2015J01669,2017J01651);福建省教育厅中青年教师项目(JA15522)~~
摘 要:各种文档中经常包含有各种特殊作用的横线、手划线等,当这些文档通过扫描等数字化方式存入计算机并需要进一步识别处理成文字编码时,这些线条却成为OCR的干扰因素,降低了文档内容的识别率.为此,本文提出一种新的文档干扰线去除算法,先将文档图像二值化,二值化过程考虑了不均匀光照带来的影响;然后将前景细化为单像素,减少线条粗细造成的影响;接着通过一种改进的贪婪算法计算横、竖两个方向线段的权重,判断权重较高的线段为干扰线;最后通过与干扰线距离的大小判断图像中每个前景像素的归属,从而获得一个完整的文档恢复图.仿真实验表明,本文提出的算法能够有效去除干扰线,特别在干扰线与文字粘连的情况下,去除干扰线的同时较少地影响文档图像的质量,且具有较高的计算速度和较好的去除效果,为图像进一步OCR识别提供了良好的基础.Documents often contain horizontal lines, hand lines, etc., which are used for various special functions. When these documents are stored in computers by scanning or the like and need to be further recognized and processed into text codes, these lines become interference factors of OCR, thus the recognition rate of document content is decreased. This study proposes a new document interference line removal algorithm, which first binarizes the document image, and the binarization process takes into account the effects of uneven illumination;then the foreground is refined into single pixels,reducing the thickness of the lines. The effect is then calculated by an improved greedy algorithm to calculate the weights of the horizontal and vertical line segments, and the line segment with higher weight is determined as the interference line;finally, the distance of each foreground pixel in the image is determined by the distance from the interference line.Thereby obtaining a complete document recovery map. The simulation results show that the proposed algorithm can effectively remove the interference lines, especially in the case of interference lines and text adhesion, and remove the interference lines while affecting the quality of document images less, and has a higher computing speed and better removal effect. The removal effect provides a good basis for further OCR recognition of images.
分 类 号:TP3[自动化与计算机技术—计算机科学与技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.166