PostScript文件文字块段落提取重构算法  

ALGORITHM OF EXTRACTING AND RESTRUCTURING TEXT BLOCK PARAGRAPH OF PostScript DOCUMENTS

在线阅读下载全文

作  者:吴一民[1] 朱濛[1] 罗绵川[1] 

机构地区:[1]华南理工大学计算机科学与工程学院,广东广州510640

出  处:《计算机应用与软件》2010年第12期273-276,共4页Computer Applications and Software

摘  要:从PostScript文件中提取文字并还原其段落格式,是电子出版领域一个新的研究方向。对PostScript文件进行解析,提取出各段落文字并获取每个文字(符号)的二维坐标信息,对每个文字(符号)的二维坐标及前后文字(符号)的二维坐标进行上下文关联分析,比较前后字符的横坐标、纵坐标,根据阈值确定段落排版方式并获取文字段分段、换行、段移位等信息。根据分析信息在文字段中插入空格或者换行,重构文字段的段落格式。It is a new research direction for electronic publishing sector to extract words from PostScript documents and to have their paragraph formats reverted. The PostScript documents are parsed to extract words from each paragraph and to acquire two-dimensional coordinates’ information of every word ( character) . The context correlation analyses are carried out on the 2-D coordinates of each word ( character) and of the word ( character) around,the abscissas and ordinates of pre-and post characters are compared. According to the threshold value the layout pattern of the paragraphs is determined,and the information of the subsection,linefeed and the shift of a paragraph,etc are got. Based on the analysis information,the blanks or the newlines are inserted into the paragraph,and the paragraph formats of the text are restructured.

关 键 词:POSTSCRIPT 二维坐标 段落 重构 排版方式 

分 类 号:TP317.2[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象