基于大语言模型的中文科技文献标注方法被引量：2

Chinese Scientific Literature Annotation Method Based on Large Language Model

作　　者：杨冬菊[1,2] 黄俊涛 YANG Dongju;HUANG Juntao(School of Information,North China University of Technology,Beijing 100144,China;Beijing Key Laboratory on Integration and Analysis of Large-scale Stream Data,Beijing 100144,China)

机构地区：[1]北方工业大学信息学院,北京100144 [2]大规模流数据集成与分析技术北京市重点实验室,北京100144

出　　处：《计算机工程》2024年第9期113-120,共8页Computer Engineering

基　　金：国家自然科学基金重点项目(61832004);广州市科技计划项目-重点研发计划(202206030009)。

摘　　要：高质量的标注数据是中文科技文献领域自然语言处理任务的重要基石。针对目前缺乏中文科技文献的高质量标注语料以及人工标注质量参差不齐且效率低下的问题,提出一种基于大语言模型的中文科技文献标注方法。首先,制定适用于多领域中文科技文献的细粒度标注规范,明确标注实体类型以及标注粒度;其次,设计结构化文本标注提示模板和生成解析器,将中文科技文献标注任务设置成单阶段单轮问答过程,将标注规范和带标注文本填充至提示模板中相应的槽位以构建任务提示词;然后,将提示词注入到大语言模型中生成包含标注信息的输出文本,经由解析器解析得到结构化的标注数据;最后,利用基于大语言模型的提示学习生成中文科技文献实体标注数据集ACSL,其中包含分布在48个学科的10000篇标注文档以及72536个标注实体,并在ACSL上提出基于RoBERTa-wwm-ext的3个基准模型。实验结果表明,BERT+Span模型在长跨度的中文科技文献实体识别任务中表现最佳,F1值为0.335。上述结果可作为后续研究的测试基准。High-quality annotated data are crucial for Natural Language Processing(NLP)tasks in the field of Chinese scientific literature.A method of annotation based on a Large Language Model(LLM)was proposed to address the lack of high-quality annotated corpora and the issues of inconsistent and inefficient manual annotation in Chinese scientific literature.First,a fine-grained annotation specification suitable for multi-domain Chinese scientific literature was established to clarify entity types and annotation granularity.Second,a structured text annotation prompt template and a generation parser were designed.The annotation task of Chinese scientific literature was set up as a single-stage,single-round question-and-answer process in which the annotation specifications and text to be annotated were filled into the corresponding slots of the prompt template to construct the task prompt.This prompt was then injected into the LLM to generate output text containing annotation information.Finally,the structured annotation data were obtained by the parser.Subsequently,using prompt learning based on LLM,the Annotated Chinese Scientific Literature(ACSL)entity dataset was generated,which contains 10000 annotated documents and 72536 annotated entities distributed across 48 disciplines.For ACSL,three baseline models based on RoBERTa-wwm-ext,a configuration of the Robustly optimized Bidirectional Encoder Representations from Transformers(RoBERT)approach,were proposed.The experimental results demonstrate that the BERT+Span model performs best on long-span entity recognition in Chinese scientific literature,achieving an F1 value of 0.335.These results serve as benchmarks for future research.

关键词：文本标注方法中文科技文献大语言模型提示学习信息抽取

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于大语言模型的中文科技文献标注方法被引量：2

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于大语言模型的中文科技文献标注方法 被引量：2

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于大语言模型的中文科技文献标注方法被引量：2