中文版面分析和重构  被引量:1

Research on Chinese Document Layout Analysis and Reconstruction

在线阅读下载全文

作  者:钟辉[1] 孙士兰[1] 刘倩[1] 

机构地区:[1]沈阳建筑大学信息与控制工程学院,辽宁沈阳110168

出  处:《沈阳建筑大学学报(自然科学版)》2008年第2期333-336,共4页Journal of Shenyang Jianzhu University:Natural Science

基  金:辽宁省自然科学基金项目(20052006)

摘  要:目的在将纸张文档数字化的过程中,解决中文文档版面信息的自动提取与恢复问题.方法通过搜索连通域,并根据连通域的尺寸特征,优先提取非文本区域,对提取出来的非文本区域,根据投影直方图、宽高比和黑白像素比等特征区分出表格、直线和图像;对文本区域采用改进的基于投影的纵横切割法来达到对文本正确分割的目的;利用XML文档文件格式描述、组织、恢复原有版面的数据和样式.通过重构生成保持原版面格式的通用电子文档,达到"原文重现"的目的.结果对大量的书籍样张和带表格、图像以及横竖混排等复杂样张的试验,结果表明改进的版面分析方法分割准确,速度快;基于XML技术的重构方法实现了对文档版面较精确的重构.结论采用统计特征得出的阈值参数用在了改进的版面分析方法中,提高了系统的适应性.该方法对较规范的文档效果较好,对复杂版面在一定的人工干预下基本可以适用.We try to automatically extract and resume Chinese document layout in the process of converting paper media documents into electronic format. First, non - text region was extracted by searching connected domain, according to the size feature of connected domain. Then extracted non - text region forms, lines and images were distinguished according to characteristics of projection histogram, aspect ratio and the ratio of black and white pixels. The correct segmentation for text region was achieved on the basis of the vertical projection and horizontal - cut method. And the original layout' s data and style were described, organized, and restored by XML document file format. The purpose of resuming the original text can be realized by reconstructing and generating universal electronic document that maintains the format of original layout. The results show that the improved layout analysis has accurate division and faster rate. The reconstruction method based on XML technology achieves more accurate reconstruction for document layout. The threshold parameters, obtained by adopting statistical characteristics, are used in the improved layout analysis methods, which have improved the system adaptability. This method suits standardized document better, and can be applied to complex layouts with certain manual intervention.

关 键 词:版面分析 版面理解 版面重构 XML 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象