基于深度学习的古籍文本自动断句与标点一体化研究被引量：6

A Joint Model of Automatic Sentence Segmentation and Punctuation for Ancient Classical Texts Based on Deep Learning

作　　者：袁义国李斌[1,2] 冯敏萱[1] 贺胜[1] 王东波[3] Yuan Yiguo;Li Bin;Feng Minxuan;He Sheng;Wang Dongbo(School of Chinese Language and Literature,Nanjing Normal University,Nanjing 210097;Institute of Digit and Humanities,Nanjing Normal University,Nanjing 210023;College of Information Management,Nanjing Agricultural University,Nanjing 210095)

机构地区：[1]南京师范大学文学院,南京210097 [2]南京师范大学数字与人文研究中心,南京210023 [3]南京农业大学信息管理学院,南京210095

出　　处：《图书情报工作》2022年第22期134-141,共8页Library and Information Service

基　　金：江苏省社会科学基金项目"人工智能辅助青少年传统文化教育研究"(项目编号:20JYB004);国家社会科学基金重大项目"中国古代典籍跨语言知识库构建及应用研究"(项目编号:21ZD&331)研究成果之一。

摘　　要：[目的/意义]中国拥有海量的古代典籍,利用计算机对古籍文本进行自动断句与标点有助于加快古籍资源的转化利用。现有研究主要存在两个亟待解决的问题。首先,将古籍断句与标点分为两个串行任务,会引起错误传递。其次,自动标注的标点也较为混乱,对长距离可嵌套的成对引号标注研究较少。[方法/过程]通过对大规模古籍语料库的标点符号频率统计,结合现有标点符号用法标准,明确古文自动标点的符号体系。根据点号含有断句信息,提出断句标点一体化处理方案,直接在没有断句的古籍文本上进行自动标点。并通过设计多元引号标记集和段首填充占位符,解决长距离可嵌套成对引号的自动标注难题。算法上根据序列标注方法,采用SikuRoBRETa-BiLSTM-CRF在1亿多字的繁体古籍文本语料上完成模型训练。[结果/结论]在开放测试集《左传》上,点号标注的F1值为77.09%,断句达到91.72%;对单个引号的标注F1值达到89.28%,成对引号为83.88%。结果表明本文的方法有效地提升了古籍文本的自动断句与自动标点效果,有效地解决了引号的自动标注问题。[Purpose/Significance]There are a large number of ancient classical books in China.Automatic sentence segmentation and punctuation of ancient book texts using computers is helpful to speed up the transformation and utilization of ancient books.There are two urgent problems in the existing research which need to be solved.First,the previous research divides automatic sentence segmentation and punctuation of ancient books into two serial tasks,which causes error accumulation.Second,the punctuations automatically tagged are relatively chaotic,and there is less research on tagging long-distance nested pairwise quotation marks.[Method/Process]Based on the statistics of punctuation frequency in a large-scale ancient books corpus and the punctuation usage standards,the paper clarified the punctuation system used in automatic punctuation of ancient books.As the sentence segmentation can be inferred by the stop punctuations,an integrated solution of sentence segmentation and punctuation was proposed,and automatic punctuation was directly carried out on ancient texts without sentence segmentation.By designing a multiple-tag set and filling placeholders at the beginning of paragraphs,the problem of automatic tagging of long-distance nested pairwise quotation marks was solved.Within the framework of sequence labeling,the algorithm used SikuRoBERTa-BiLSTM-CRF to train model on the corpus of traditional Chinese ancient books which contains more than 100 million characters.[Result/Conclusion]In the open test Zuo Zhuan,the F1 score of stop punctuations tagging is 77.09%,sentence segmentation is 91.72%.The F1 score of a single quotation marks tagging is 89.28%,and the pairwise quotation marks tagging is 83.88%.The results show that the method in the paper effectively improves the effect of automatic sentence segmentation and punctuation of ancient books,and effectively solves the problem of automatically tagging quotation marks.

关键词：自动断句自动标点古籍深度学习数字人文

分类号：TP391.11[自动化与计算机技术—计算机应用技术] G250[自动化与计算机技术—计算机科学与技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于深度学习的古籍文本自动断句与标点一体化研究被引量：6

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于深度学习的古籍文本自动断句与标点一体化研究 被引量：6

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于深度学习的古籍文本自动断句与标点一体化研究被引量：6