我国民族语言文献文本数字化识别问题——基于OCR及其工具  被引量:1

Digital Recognition of Minority Language Documents in China——Based on OCR and its ways

在线阅读下载全文

作  者:范俊军[1] 刘贤娴 FAN Junjun;LIU Xianxian

机构地区:[1]暨南大学文学院

出  处:《暨南学报(哲学社会科学版)》2024年第6期31-45,共15页Jinan Journal(Philosophy and Social Sciences)

摘  要:我国少数民族语言文献数量庞大,文字种类繁多,内容涵盖政治、经济、法律、历史、文学、艺术、宗教、天文、地理、医药等领域,是中华民族文化知识的重要组成部分。构建各民族文献文本数据,使之应用于自然语言处理和人工智能,能有效促进中华优秀传统知识创新性传承,促进知识社会化,是对各民族语言古文献和现代书报刊进行文字识别和文本转换数据构建的基础。国内早期OCR技术虽然解决了几种主要少数民族文字识别的问题,但因字符为非Unicode基本集编码而弃用。当前OCR技术已能较好识别蒙、藏、维、哈、朝等文种文献,但在处理我国汉文与少数民族文字混排图像文本时仍然效果不佳。因此应推进少数民族语言文献OCR识别技术创新。我国少数民族语言文献现行活态文字有十多种,其中非拉丁字系的文字有11种,OCR技术应重点解决这类少数民族语言字系的抄本、刻版和铅字印刷文本,以及汉文与民族文字混排文本的识别问题,研发开放的多功能工具和平台。在此基础上,进一步开展少数民族语言文献文本大规模数据构建,以促进我国语言科学研究和自然语言处理的创新发展。China has over 130 minority languages and more than 10 minority scripts.These have preserved a wealth of ethnic language documents,including a large number of ancient manuscripts and modern printed documents.These records capture the long-standing civilization of the Chinese nation and the knowledge and practices of various ethnic groups in their production and daily life.The content covers a wide range of areas,including politics,economics,law,history,literature,art,religion,medicine,astronomy,and geography,reflecting the exchange,integration,and innovation of various ethnic cultures.Fully utilizing contemporary data science and artificial intelligence(AI)to innovate text recognition technology for minority language documents and achieving the digitization of massive amounts of literature is of great historical and cultural significance,and practical political significance.This effort is crucial for the scientific protection of Chinese minority language document resources and the inheritance of excellent Chinese traditional knowledge and cultural spirit.Collecting and organizing minority language books,newspapers,and manuscripts for large-scale text recognition and digitization is a crucial source for building natural language processing(NLP)and AI datasets.The digitization of minority language documents involves two fundamental tasks:(1)compiling and cataloging various documents to create indexed data;and(2)performing optical character recognition(OCR)on the content of these documents to convert them into computer-processable text files.The recognition of minority language text is the prerequisite for the digitization of document content,while OCR text recognition is key to constructing large-scale corpora and knowledge text data in China s native languages.Efficient document OCR recognition technology has broad applications.It enables publishers to transition from passively receiving manuscripts to actively creating knowledge content,maximizing content production potential.Additionally,it facilitates the extra

关 键 词:少数民族语言 民族文献 文本识别 OCR 数据构建 数字人文 

分 类 号:H2[语言文字—少数民族语言]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象