古代汉字文献切分研究  被引量:8

Research on segmentation of historical Chinese books

在线阅读下载全文

作  者:倪恩志[1] 蒋旻隽[2] 周昌乐[1] 

机构地区:[1]厦门大学信息科学与技术学院,艺术认知与计算实验室,福建厦门361005 [2]上海应用技术学院计算机科学与信息工程学院,上海201418

出  处:《计算机工程与应用》2013年第2期29-33,38,共6页Computer Engineering and Applications

基  金:国家自然科学基金(No.6097507)

摘  要:针对古代汉字文档的特点,提出了适合于古文档的列切分方法和字切分方法。提出的列切分方法直接对文档的笔画投影进行分析,采用一种基于分层投影过滤和变长间隙阈值的递归切分算法。该算法在列间隔较小、列与格线存在粘连、文档具有一定程度的倾斜的情况下,也能准确地抽取出列,尤其对短列的切分达到了较好的效果。提出的字切分方法分为两步,进行粗切分确定大致的切分位置,采用基于连通域分析与粘连点判断的方法做进一步的细切分。该算法对具有较多粘连和重叠汉字的列,也能较好地切分出完整的单字。实验结果表明,提出的方法用于古代汉字文档切分能够获得较好的效果。In this paper, the methods of text line segmentation and character segmentation are proposed according to the charac- teristics of historical Chinese documents. The method of line segmentation analyzes stroke projection, and adopts a recursive segmentation algorithm based on various project thresholds and gap thresholds. This algorithm is robust in the cases of text line adhesion and skew, especially short text lines. The method of character segmentation has two steps. A rough segmentation is applied to get the approximate positions of segmentation. A fine segmentation based on the analysis of connected components and the judgment of adhesion points is carried out. This algorithm can extract the characters even though they overlap and connect each other. The experimental results show the methods have good performance and are suitable for the segmentation of historical Chinese documents.

关 键 词:文档图像处理 文档切分 古籍数字化 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象