一种面向PDF文档的文本复制优化方法研究  被引量:1

Research on a Text Copy Optimization Method for PDF Documents

在线阅读下载全文

作  者:贺伟雄 柏林元 郭文娟 HE Weixiong;BAI Linyuan;GUO Wenjuan(Academy of People's Armed Police,Beijing 100010;Army Engineering University of PLA,Nanjing Jiangsu 210001)

机构地区:[1]武警部队研究院,北京100010 [2]陆军工程大学,江苏南京210001

出  处:《软件》2022年第7期63-67,共5页Software

摘  要:针对当前主流PDF阅读器复制文字尤其是中英文混合排版文字时存在的全角字符、错误标点符号、多余换行符和空格等问题,提出了一种面向PDF文档的文本复制优化方法,通过剪贴板监听自动感知复制内容变化,基于正则表达式分析复制文本内容特点并采用不同优化策略修正文本格式错误,并提出了3种不同的段落切分策略正确识别文本中的段落,实现了用户“无感知”情况下的复制文本自动优化。在报纸、社科、理工和国防类期刊等4类PDF数据集的实验表明,与直接复制相比,提出的方法能够消除95%以上的格式错误,极大地减轻了人工负担,提高了处理效率。To solve the problems of full-corner characters,wrong punctuation marks,redundant line breaks,and spaces in the copying of text,especially the mixed typesetting text in Chinese and English,in the current mainstream PDF readers,a text copying optimization method for PDF documents was proposed.Based on the regular expression analysis of the characteristics of the copied text content,different optimization strategies were adopted to correct the formatting errors of the text.Three different paragraph segmentation strategies were proposed to correctly identify paragraphs in the text,which realized the automatic optimization of the copied text in the case of"No Perception"by users.Experiments on four kinds of PDF data sets,such as newspaper,social science,science and technology,and national defense journals,show that compared with direct copying,the proposed method can eliminate more than 95%of format errors,significantly reduce the manual burden and improve the processing efficiency.

关 键 词:PDF文档 文本复制 文本优化 段落切分 

分 类 号:TP391[自动化与计算机技术—计算机应用技术] G312[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象