基于生成式大语言模型的古文自动断句与标点研究

Automatic Sentence Segmentation and Automatic Punctuation for Ancient Chinese Texts Based on Generative Large Language Model

作　　者：夏天[1] 于凯余芊蓉彭欣然赵群杨孟辉[1] Xia Tian;Yu Kai;Yu Qianrong;Peng Xinran;Zhao Qun;Yang Menghui(School of Information Resource Management,Renmin University of China,Beijing 100872;Midu Technology Co.,Ltd.,Shanghai 201204)

机构地区：[1]中国人民大学信息资源管理学院,北京100872 [2]蜜度科技股份有限公司,上海201204

出　　处：《图书情报工作》2025年第5期59-70,共12页Library and Information Service

基　　金：国家社会科学基金项目“基于图书全内容的知识发现与智能服务研究”(项目编号:22BTQ068)的研究成果之一。

摘　　要：[目的/意义]将生成式大语言模型用于古文自动断句与标点任务,解决传统序列标注模型需特殊设计标记并构造标注数据的局限,帮助提升断句与标点的效果。[方法/过程]采用滑动窗口策略对训练数据进行分块以增加可学习样本数量,利用最小哈希和局部敏感哈希为无标点文本提供参考样例,并对大语言模型的解码过程进行约束控制。以荀子古籍大语言模型作为基座模型并运用低秩适应LoRA方式进行微调,让模型理解和对齐古文标点任务,由无标点文本生成含有标点字符的目标文本。[结果/结论 ]在EvaHan 2024公布的两个可对比测试集上,自动断句F1指标分别为88.47%和92.48%,自动标点F1指标分别为75.29%和80.01%,显著优于荀子大语言模型和ChatGPT 3.5,表明生成式大语言模型是解决古文断句和标点任务的可行途径。[Purpose/Significance]This study aims to improve automatic sentence segmentation and punc-tuation in ancient Chinese texts using generative large language models,overcoming the limitations of traditional sequence labeling models that require specialized label design and annotated data construction.[Method/Process]In order to increase the number of learnable samples,a sliding window strategy was used to segment training data.Additionally,MinHash and Locality-Sensitive Hashing(LSH)were employed to provide reference samples for un-punctuated texts,and constraints were applied to the decoding process of the large language model.The Xunzi an-cient literature large language model served as the base model.Fine-tuning was performed using Low-Rank Adap-tation(LoRA)to align the model with the task of ancient Chinese automatic punctuation,enabling the generation of target texts with punctuation from unpunctuated texts.[Result/Conclusion]On two benchmark datasets released by EvaHan 2024,the F1 scores of automatic sentence segmentation are 88.47%and 92.48%,and the F1 scores of au-tomatic punctuation are 75.29%and 80.01%,respectively.These results significantly outperform the Xunzi ancient large language model and ChatGPT 3.5,indicating that generative large language models are a feasible approach for addressing the tasks of sentence segmentation and punctuation in ancient Chinese texts.

关键词：自动断句自动标点古籍数字人文大语言模型

分类号：TP391.1[自动化与计算机技术—计算机应用技术] G250[自动化与计算机技术—计算机科学与技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于生成式大语言模型的古文自动断句与标点研究

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于生成式大语言模型的古文自动断句与标点研究

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索