检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:范俊军[1] 刘贤娴 FAN Junjun;LIU Xianxian
机构地区:[1]暨南大学文学院
出 处:《暨南学报(哲学社会科学版)》2024年第6期31-45,共15页Jinan Journal(Philosophy and Social Sciences)
摘 要:我国少数民族语言文献数量庞大,文字种类繁多,内容涵盖政治、经济、法律、历史、文学、艺术、宗教、天文、地理、医药等领域,是中华民族文化知识的重要组成部分。构建各民族文献文本数据,使之应用于自然语言处理和人工智能,能有效促进中华优秀传统知识创新性传承,促进知识社会化,是对各民族语言古文献和现代书报刊进行文字识别和文本转换数据构建的基础。国内早期OCR技术虽然解决了几种主要少数民族文字识别的问题,但因字符为非Unicode基本集编码而弃用。当前OCR技术已能较好识别蒙、藏、维、哈、朝等文种文献,但在处理我国汉文与少数民族文字混排图像文本时仍然效果不佳。因此应推进少数民族语言文献OCR识别技术创新。我国少数民族语言文献现行活态文字有十多种,其中非拉丁字系的文字有11种,OCR技术应重点解决这类少数民族语言字系的抄本、刻版和铅字印刷文本,以及汉文与民族文字混排文本的识别问题,研发开放的多功能工具和平台。在此基础上,进一步开展少数民族语言文献文本大规模数据构建,以促进我国语言科学研究和自然语言处理的创新发展。China has over 130 minority languages and more than 10 minority scripts.These have preserved a wealth of ethnic language documents,including a large number of ancient manuscripts and modern printed documents.These records capture the long-standing civilization of the Chinese nation and the knowledge and practices of various ethnic groups in their production and daily life.The content covers a wide range of areas,including politics,economics,law,history,literature,art,religion,medicine,astronomy,and geography,reflecting the exchange,integration,and innovation of various ethnic cultures.Fully utilizing contemporary data science and artificial intelligence(AI)to innovate text recognition technology for minority language documents and achieving the digitization of massive amounts of literature is of great historical and cultural significance,and practical political significance.This effort is crucial for the scientific protection of Chinese minority language document resources and the inheritance of excellent Chinese traditional knowledge and cultural spirit.Collecting and organizing minority language books,newspapers,and manuscripts for large-scale text recognition and digitization is a crucial source for building natural language processing(NLP)and AI datasets.The digitization of minority language documents involves two fundamental tasks:(1)compiling and cataloging various documents to create indexed data;and(2)performing optical character recognition(OCR)on the content of these documents to convert them into computer-processable text files.The recognition of minority language text is the prerequisite for the digitization of document content,while OCR text recognition is key to constructing large-scale corpora and knowledge text data in China s native languages.Efficient document OCR recognition technology has broad applications.It enables publishers to transition from passively receiving manuscripts to actively creating knowledge content,maximizing content production potential.Additionally,it facilitates the extra
关 键 词:少数民族语言 民族文献 文本识别 OCR 数据构建 数字人文
分 类 号:H2[语言文字—少数民族语言]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.49