一种改进的基于n-gram的古汉语断句与标点方法  

An Improved Method Based on n-gram Model for Ancient Chinese Sentence Segmentation and Punctuation

在线阅读下载全文

作  者:秦瑞琳 QIN Ruilin(College of Computer Engineering,Jimei University,Xiamen 361021,China)

机构地区:[1]集美大学计算机工程学院,福建厦门361021

出  处:《集美大学学报(自然科学版)》2025年第2期198-204,共7页Journal of Jimei University:Natural Science

基  金:福建省中青年教师教育科研项目“情感感受的量子计算模型及其仿真实现”(JAT210243);厦门市自然科学基金项目“引入量子机制的机器人情感计算模型及其仿真实现”(3502Z202473063)。

摘  要:古汉语文本的自动断句与标点对提高我国古籍整理的自动化水平具有重要意义。现有古汉语断句与标点算法大多缺少对前后标点间相互影响的考虑。针对这一问题,本文提出一种改进的基于n-gram的古汉语断句与标点方法。该方法综合考虑了二元组到五元组的上下文信息,加权计算当前位置标点的概率,并据此辅助计算前后位置标点的概率,从而反映出前后标点间的相互影响。在多种古籍语料上的实验表明,所提方法在断句任务上能够取得比现有n-gram和GRU-RNN模型更高的F 1值,且在部分语料上的断句与标点性能优于BiLSTM+CRF模型。The automatic sentence segmentation and punctuation of ancient Chinese texts are of great significance to the improvement of the automatic level of Chinese ancient books.Most of the existing algorithms lack the consideration of the interaction between the preceding and the following punctuation marks.To address this issue,this paper proposes an improved method based on n-gram model.The method comprehensively considers the contextual information from 2-grams to 5-grams and calculates the punctuation probability of current position by weighting,which further assists in calculating the punctuation probability of the preceding and the following position,thereby reflecting the mutual influence between the preceding and the following punctuation marks.Experiments on various ancient-book corpora show that the proposed method achieves higher F 1-scores than existing n-gram and GRU-RNN models on sentence segmentation,and performs better than BiLSTM+CRF model on sentence segmentation and punctuation in some corpora.

关 键 词:古汉语 断句 标点 N-GRAM模型 深度学习 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象