基于片段抽取原型网络的古籍文本断句标点提示学习方法  

Prompt learning method for ancient text sentence segmentation and punctuation based on span-extracted prototypical network

在线阅读下载全文

作  者:高颖杰 林民 斯日古楞 李斌 张树钧 GAO Yingjie;LIN Min;Siriguleng;LI Bin;ZHANG Shujun(College of Computer Science and Technology,Inner Mongolia Normal University,Hohhot Inner Mongolia 010022,China;School of Chinese Language and Literature,Inner Mongolia Normal University,Hohhot Inner Mongolia 010022,China;College of Computer Science and Technology,Inner Mongolia Minzu University,Tongliao Inner Mongolia 028000,China)

机构地区:[1]内蒙古师范大学计算机科学技术学院,呼和浩特010022 [2]内蒙古师范大学文学院,呼和浩特010022 [3]内蒙古民族大学计算机科学与技术学院,内蒙古通辽028000

出  处:《计算机应用》2024年第12期3815-3822,共8页journal of Computer Applications

基  金:国家自然科学基金资助项目(62266033);内蒙古自然科学基金资助项目(2021LHMS06010);内蒙古自治区科技计划项目(2021GG0218);内蒙古自治区级教育部重点实验室开放课题(2023KFZD03);内蒙古自治区硕士研究生科研创新项目(S20231076Z);内蒙古师范大学基本科研业务费专项(2022JBXC018)。

摘  要:针对古籍信息处理中自动断句及标点任务依赖大规模标注语料的现象,在考虑高质量、大规模样本的训练成本昂贵且难以获取的背景下,提出一种基于片段抽取原型网络的古籍文本断句标点提示学习方法。首先,通过对支持集加入结构化提示信息形成有效的提示模板,从而提高模型的学习效率;其次,结合标点位置提取器和原型网络分类器,有效减少传统序列标注方法中的误判影响及非标点标签的干扰。实验结果表明,与Siku-BERT-BiGRU-CRF(Siku-Bidirectional Encoder Representation from Transformer-Bidirectional Gated Recurrent Unit-Conditional Random Field)方法相比,在《史记》数据集上所提方法的F1值提升了2.47个百分点。此外,在公开的多领域古籍数据集CCLUE上,所提方法的精确率和F1值分别达到了91.60%和93.12%,说明所提方法利用少量训练样本就能对多领域古籍进行有效的自动断句标点。因此,所提方法为多领域古籍文本的自动断句及标点任务的深入研究以及提高模型的学习效率提供了新的思路和方法。In view of the phenomenon that automatic sentence segmentation and punctuation task in ancient book information processing relies on large-scale annotated corpora,and considering that training high-quality,large-scale samples is expensive and these samples are difficult to obtain,a prompt learning method for ancient text sentence segmentation and punctuation based on span-extracted prototypical network was proposed.Firstly,structured prompt information was incorporated into the support set to form an effective prompt template,so as to improve the models learning efficiency.Then,combined with a punctuation position extractor and a prototype network classifier,the misjudgment impact and the interference from non-punctuation labels in traditional sequence labeling method were effectively reduced.Experimental results show that on Records of the Grand Historian dataset,the F1 score of the proposed method is 2.47 percentage points higher than that of the Siku-BERT-BiGRU-CRF(Siku-Bidirectional Encoder Representation from Transformer-Bidirectional Gated Recurrent Unit-Conditional Random Field)method.In addition,on the public multi-domain ancient text dataset CCLUE,the precision and F1 score of this method reach 91.60%and 93.12%respectively,indicating that the method can perform sentence segmentation and punctuation in multi-domain ancient text effectively and automatically by using a small number of training samples.Therefore,the proposed method offers new thought and approach for conducting in-depth research on automatic sentence segmentation and punctuation,as well as for enhancing the models learning efficiency,in multi-domain ancient text.

关 键 词:古籍智能整理 片段抽取原型网络 提示学习 自动断句标点 深度学习 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象