结合文字核心区域和扩展生长的藏文古籍文本行切分被引量：4

Text Line Segmentation of Tibetan Historical Documents Based on Text Core Regions Combined with Expansion Growth

作　　者：李金成王筱娟王维兰[1] 林强胡鹏飞 Li Jincheng;Wang Xiaojuan;Wang Weilan;Lin Qiang;Hu Pengfei(Key Laboratory of China's Ethnic Languages and Information Technology of Ministry of Education,Northwest Minzu University,Lanzhou,Gansu 730030,China;College of Mathematics and Computer Science,Northwest Minzu University,Lanzhou,Gansu 730030,China)

机构地区：[1]西北民族大学中国民族语言文字信息技术教育部重点实验室,甘肃兰州730030 [2]西北民族大学数学与计算机科学学院,甘肃兰州730030

出　　处：《激光与光电子学进展》2021年第2期105-115,共11页Laser & Optoelectronics Progress

基　　金：国家自然科学基金(61772430);国家民委创新团队计划(〔2018〕98号);甘肃省双一流学科建设项目(11080304);甘肃省高等学校创新能力提升项目(2019B-024);西北民族大学中央高校基本科研业务费项目(31920180050)。

摘　　要：藏文古籍文档图像中相邻文本行之间通常存在黏连和重叠的情况,这使得文本行切分成为一项艰巨的任务。因此,提出了一种结合文字核心区域和扩展生长的藏文古籍文档图像的行切分方法。首先,根据二值藏文古籍文档图像中连通域的面积和真圆度去除非音节点,获得音节点图像。其次,通过水平投影音节点图像和垂直投影二值原图,得到文本行基线所处的范围和文本行数,生成文字核心区域;通过像素值的或运算将文字核心区域和二值原图结合,得到伪文本连通区域。最后,基于广度优先搜索算法将文字核心区域扩展为伪文本连通区域,获得伪文本行连通区域,通过去掉其中的非文字区域来获得伪文本行,利用有效的断裂笔画行归属方法获得最终的文本行。实验结果表明,所提方法取得了较好的文本行切分结果,有效解决了文本行之间的重叠、部分行黏连以及笔画断裂等藏文古籍文本行切分的问题。In the Tibetan historical document images,there usually exist adhesion and overlapping between adjacent text lines,which makes text line segmentation become a difficult task.We propose a method for line segmentation of Tibetan historical document images,which combines the text core regions and expansion growth.First,the non-syllable points are removed according to the area and roundness of the connected components in the binary Tibetan historical document images and thus the syllable point images are obtained.Second,through the syllable point image via horizontal projection and the binary original image via vertical projection,the scope of the text line baselines and the number of text lines are obtained and the text core regions are generated.Meanwhile,the text core regions are combined with the binary original images via the or operation of pixel values to obtain the pseudo-text connected regions.Finally,based on the breadth-first-search algorithm,the expansion growth from the text core regions to the pseudo-text connected regions is realized and the pseudo-text line connected regions are obtained.The non-literal regions are removed to obtain the pseudo-text lines,and the final text lines are obtained through an effective algorithm for the line attribution of broken strokes.The experimental results show that the proposed method achieves relatively good text line segmentation effect and effectively solves the problems in text line segmentation of Tibetan historical documents,such as overlapping between text lines,partial adhesion between lines and stroke breaking.

关键词：图像处理藏文古籍文档图像文本行切分文字核心区域扩展生长

分类号：TP391.1[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

结合文字核心区域和扩展生长的藏文古籍文本行切分被引量：4

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

结合文字核心区域和扩展生长的藏文古籍文本行切分 被引量：4

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

结合文字核心区域和扩展生长的藏文古籍文本行切分被引量：4