检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
机构地区:[1]德国海德堡大学跨文化研究中心 [2]德国海德堡大学汉学系
出 处:《数字人文研究》2023年第4期49-62,共14页Digital Humanities Research
摘 要:欧洲和北美众多研究学者已对机器学习在光学字符识别中的应用进行了探索,许多项目也正在为此创建基准真值(ground truth,GT)数据。但对于非拉丁文本(non-Latin script)阅读材料来说,情况则有所不同。德国海德堡大学的“中国早期报刊在线数据库”(ECPO)项目于2021年开始研究如何基于中国报刊史料生成机器可读文本。ECPO采用多种机器学习方法(如卷积神经网络)开发了一个半自动流程来生成机器可读的全文文本,并选取民国时期娱乐小报《晶报》(1919—1940年)作为实验基础。文章聚焦于两方面:一是对基准真值编辑工作流程作详细阐述,包括组建编辑团队、组织工作流程、建立操作规范和确保质量控制;二是探讨制作基准真值时遇到的具体困难,包括字符编码问题、与Unicode相关的异体字符问题等。该研究项目创建了两个基准真值数据集,分别是文本型/结构化数据(全文基准真值,full-text GT)和版面分割数据(几何基准真值,geometry GT)。此外,文章还指出研究项目发现的问题及应对方案,期望提高机器学习效率,并为其他从事非拉丁文阅读材料研究的同仁提供借鉴。Many researchers have explored the use of machine learning for optical character recognition(OCR),particularly in Europe and North America,and many projects are producing ground truth(GT)data for this purpose.It is different when it comes to non-Latin script(NLS)material.The Early Chinese Periodicals Online(ECPO)project at the University of Heidelberg started to work on ways to produce machine-readable full text from historical Chinese newspapers in 2021.ECPO uses different machine-learning approaches,including convolutional neural networks,to develop a semi-automatic pipeline to produce machine-readable full text.We chose the entertainment newspaper Jing Bao(The Crystal,1919-1940)as the basis for our experiments.Our paper focuses on two main aspects:First,we provide a description of our ground truth editing work.It includes assembling the editing team,organizing the workflows,establishing processing regulations,and ensuring quality control.Secondly,we discuss particular challenges in producing the GT sets,including issues in character encoding and problems with variant characters related to Unicode.We produced two sets of ground truth data comprising textual/structural data(full-text GT)and segmentation data(geometry GT).We hope our experiences from the project can be helpful to others working with NLS material.Based on our work,we point out some pitfalls and provide hints to avoid them in order to make machine learning more efficient.
分 类 号:H127[语言文字—汉语] TP391.1[自动化与计算机技术—计算机应用技术] K26[自动化与计算机技术—计算机科学与技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.137.159.3