基于预训练语言模型的繁体古文自动句读研究  被引量:2

Automatic Traditional Ancient Chinese Texts Segmentation and Punctuation Based on Pre-trained Language Model

在线阅读下载全文

作  者:唐雪梅 苏祺[2,3,4] 王军[1,2,4] 陈雨航[1,2] 杨浩 TANG Xuemei;SU Qi;WANG Jun;CHEN Yuhang;YANG Hao(Department of Information Management,Peking University,Beijing 100871,China;Digital Humanities Center of Peking University,Beijing 100871,China;School of Foreign Languages,Peking University,Beijing 100871,China;Institute for Artificial Intelligence,Peking University,Beijing 100871,China)

机构地区:[1]北京大学信息管理系,北京100871 [2]北京大学数字人文研究中心,北京100871 [3]北京大学外国语学院,北京100871 [4]北京大学人工智能研究院,北京100871

出  处:《中文信息学报》2023年第8期159-168,共10页Journal of Chinese Information Processing

基  金:国家自然科学基金(72010107003)。

摘  要:未经整理的古代典籍不含任何标点,不符合当代人的阅读习惯,古籍加断句标点之后有助于阅读、研究和出版。该文提出了一种基于预训练语言模型的繁体古文自动句读框架。该文整理了约10亿字的繁体古文语料,对预训练语言模型进行增量训练,在此基础上实现古文自动句读和标点。实验表明,经过大规模繁体古文语料增量训练后的语言模型具备更好的古文语义表示能力,能够有助提升繁体古文自动句读和自动标点的效果。融合增量训练模型之后,古文断句F1值达到95.03%,古文标点F1值达到了80.18%,分别比使用未增量训练的语言模型提升1.83%和2.21%。为解决现有篇章级句读方案效率低的问题,该文改进了前人的串行滑动窗口方案,在一定程度上提高了句读效率,并提出一种新的并行滑动窗口方案,能够高效准确地进行长文本自动句读。Ancient books without annotations do not contain any punctuation,which is not in line with modern people's reading habits.Punctuation in ancient books is helpful for reading,research and publication.In this paper,we propose an automatic punctuation framework for traditional ancient Chinese text based on pre-trained language model.We build a traditional Chinese ancient corpus,containing about 1 billion characters.We incrementally train BERT(Bidirectional Encoder Representation from Transformers)using this corpus.Experimental results demonstrate that the incrementally trained language model exhibits enhanced semantic representation ability for ancient Chinese texts.The segmentation and punctuation F1 scores of the incrementally trained model reach 95.03%and 80.18%,respectively,representing improvements of 1.83%and 2.21%compared to the language model without incremental training.To address the efficiency issue in existing long text segmentation,we modify the previous serial sliding window approach,leading to improved text segmentation efficiency to a certain extent.Additionally,in contrast to the existing serial sliding window approach,we propose a new parallel sliding window mode,which efficiently and accurately segments long text automatically.

关 键 词:自动句读 自动标点 预训练语言模型 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象