检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:杨冬菊[1,2] 黄俊涛 YANG Dongju;HUANG Juntao(School of Information,North China University of Technology,Beijing 100144,China;Beijing Key Laboratory on Integration and Analysis of Large-scale Stream Data,Beijing 100144,China)
机构地区:[1]北方工业大学信息学院,北京100144 [2]大规模流数据集成与分析技术北京市重点实验室,北京100144
出 处:《计算机工程》2024年第9期113-120,共8页Computer Engineering
基 金:国家自然科学基金重点项目(61832004);广州市科技计划项目-重点研发计划(202206030009)。
摘 要:高质量的标注数据是中文科技文献领域自然语言处理任务的重要基石。针对目前缺乏中文科技文献的高质量标注语料以及人工标注质量参差不齐且效率低下的问题,提出一种基于大语言模型的中文科技文献标注方法。首先,制定适用于多领域中文科技文献的细粒度标注规范,明确标注实体类型以及标注粒度;其次,设计结构化文本标注提示模板和生成解析器,将中文科技文献标注任务设置成单阶段单轮问答过程,将标注规范和带标注文本填充至提示模板中相应的槽位以构建任务提示词;然后,将提示词注入到大语言模型中生成包含标注信息的输出文本,经由解析器解析得到结构化的标注数据;最后,利用基于大语言模型的提示学习生成中文科技文献实体标注数据集ACSL,其中包含分布在48个学科的10000篇标注文档以及72536个标注实体,并在ACSL上提出基于RoBERTa-wwm-ext的3个基准模型。实验结果表明,BERT+Span模型在长跨度的中文科技文献实体识别任务中表现最佳,F1值为0.335。上述结果可作为后续研究的测试基准。High-quality annotated data are crucial for Natural Language Processing(NLP)tasks in the field of Chinese scientific literature.A method of annotation based on a Large Language Model(LLM)was proposed to address the lack of high-quality annotated corpora and the issues of inconsistent and inefficient manual annotation in Chinese scientific literature.First,a fine-grained annotation specification suitable for multi-domain Chinese scientific literature was established to clarify entity types and annotation granularity.Second,a structured text annotation prompt template and a generation parser were designed.The annotation task of Chinese scientific literature was set up as a single-stage,single-round question-and-answer process in which the annotation specifications and text to be annotated were filled into the corresponding slots of the prompt template to construct the task prompt.This prompt was then injected into the LLM to generate output text containing annotation information.Finally,the structured annotation data were obtained by the parser.Subsequently,using prompt learning based on LLM,the Annotated Chinese Scientific Literature(ACSL)entity dataset was generated,which contains 10000 annotated documents and 72536 annotated entities distributed across 48 disciplines.For ACSL,three baseline models based on RoBERTa-wwm-ext,a configuration of the Robustly optimized Bidirectional Encoder Representations from Transformers(RoBERT)approach,were proposed.The experimental results demonstrate that the BERT+Span model performs best on long-span entity recognition in Chinese scientific literature,achieving an F1 value of 0.335.These results serve as benchmarks for future research.
关 键 词:文本标注方法 中文科技文献 大语言模型 提示学习 信息抽取
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.128.24.183