检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:贺伟雄 柏林元 郭文娟 HE Weixiong;BAI Linyuan;GUO Wenjuan(Academy of People's Armed Police,Beijing 100010;Army Engineering University of PLA,Nanjing Jiangsu 210001)
机构地区:[1]武警部队研究院,北京100010 [2]陆军工程大学,江苏南京210001
出 处:《软件》2022年第7期63-67,共5页Software
摘 要:针对当前主流PDF阅读器复制文字尤其是中英文混合排版文字时存在的全角字符、错误标点符号、多余换行符和空格等问题,提出了一种面向PDF文档的文本复制优化方法,通过剪贴板监听自动感知复制内容变化,基于正则表达式分析复制文本内容特点并采用不同优化策略修正文本格式错误,并提出了3种不同的段落切分策略正确识别文本中的段落,实现了用户“无感知”情况下的复制文本自动优化。在报纸、社科、理工和国防类期刊等4类PDF数据集的实验表明,与直接复制相比,提出的方法能够消除95%以上的格式错误,极大地减轻了人工负担,提高了处理效率。To solve the problems of full-corner characters,wrong punctuation marks,redundant line breaks,and spaces in the copying of text,especially the mixed typesetting text in Chinese and English,in the current mainstream PDF readers,a text copying optimization method for PDF documents was proposed.Based on the regular expression analysis of the characteristics of the copied text content,different optimization strategies were adopted to correct the formatting errors of the text.Three different paragraph segmentation strategies were proposed to correctly identify paragraphs in the text,which realized the automatic optimization of the copied text in the case of"No Perception"by users.Experiments on four kinds of PDF data sets,such as newspaper,social science,science and technology,and national defense journals,show that compared with direct copying,the proposed method can eliminate more than 95%of format errors,significantly reduce the manual burden and improve the processing efficiency.
分 类 号:TP391[自动化与计算机技术—计算机应用技术] G312[自动化与计算机技术—计算机科学与技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.119.0.207