基于生成式大语言模型的古文自动断句与标点研究  

Automatic Sentence Segmentation and Automatic Punctuation for Ancient Chinese Texts Based on Generative Large Language Model

在线阅读下载全文

作  者:夏天[1] 于凯 余芊蓉 彭欣然 赵群 杨孟辉[1] Xia Tian;Yu Kai;Yu Qianrong;Peng Xinran;Zhao Qun;Yang Menghui(School of Information Resource Management,Renmin University of China,Beijing 100872;Midu Technology Co.,Ltd.,Shanghai 201204)

机构地区:[1]中国人民大学信息资源管理学院,北京100872 [2]蜜度科技股份有限公司,上海201204

出  处:《图书情报工作》2025年第5期59-70,共12页Library and Information Service

基  金:国家社会科学基金项目“基于图书全内容的知识发现与智能服务研究”(项目编号:22BTQ068)的研究成果之一。

摘  要:[目的/意义]将生成式大语言模型用于古文自动断句与标点任务,解决传统序列标注模型需特殊设计标记并构造标注数据的局限,帮助提升断句与标点的效果。[方法/过程]采用滑动窗口策略对训练数据进行分块以增加可学习样本数量,利用最小哈希和局部敏感哈希为无标点文本提供参考样例,并对大语言模型的解码过程进行约束控制。以荀子古籍大语言模型作为基座模型并运用低秩适应LoRA方式进行微调,让模型理解和对齐古文标点任务,由无标点文本生成含有标点字符的目标文本。[结果/结论 ]在EvaHan 2024公布的两个可对比测试集上,自动断句F1指标分别为88.47%和92.48%,自动标点F1指标分别为75.29%和80.01%,显著优于荀子大语言模型和ChatGPT 3.5,表明生成式大语言模型是解决古文断句和标点任务的可行途径。[Purpose/Significance]This study aims to improve automatic sentence segmentation and punc-tuation in ancient Chinese texts using generative large language models,overcoming the limitations of traditional sequence labeling models that require specialized label design and annotated data construction.[Method/Process]In order to increase the number of learnable samples,a sliding window strategy was used to segment training data.Additionally,MinHash and Locality-Sensitive Hashing(LSH)were employed to provide reference samples for un-punctuated texts,and constraints were applied to the decoding process of the large language model.The Xunzi an-cient literature large language model served as the base model.Fine-tuning was performed using Low-Rank Adap-tation(LoRA)to align the model with the task of ancient Chinese automatic punctuation,enabling the generation of target texts with punctuation from unpunctuated texts.[Result/Conclusion]On two benchmark datasets released by EvaHan 2024,the F1 scores of automatic sentence segmentation are 88.47%and 92.48%,and the F1 scores of au-tomatic punctuation are 75.29%and 80.01%,respectively.These results significantly outperform the Xunzi ancient large language model and ChatGPT 3.5,indicating that generative large language models are a feasible approach for addressing the tasks of sentence segmentation and punctuation in ancient Chinese texts.

关 键 词:自动断句 自动标点 古籍 数字人文 大语言模型 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术] G250[自动化与计算机技术—计算机科学与技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象