机构地区:[1]中国林业科学研究院资源信息研究所、国家林业和草原局林业遥感与信息技术重点实验室,北京100091 [2]北京林业大学林学院,北京100083 [3]夏威夷大学马诺阿分校,檀香山HI 96822
出 处:《林业科学》2024年第9期99-110,共12页Scientia Silvae Sinicae
基 金:国家重点研发计划项目(2022YFE0128100)。
摘 要:【目的】针对林业文本利用率低、通用领域预训练语言模型对林业知识理解不足以及手动标注数据耗时费力等问题,基于大量林业文本,提出一种融合林业领域知识的预训练语言模型,并通过自动标注训练数据,高效实现林业抽取式问答,为林业决策管理提供智能化信息服务。【方法】首先,基于网络爬虫技术构建包含术语、法律法规和文献3个主题的林业语料库,使用该语料库对通用领域预训练语言模型BERT进行继续预训练,再通过掩码语言模型和下一句预测这2个任务进行自监督学习,使BERT能够有效地学习林业语义信息,得到具有林业文本通用特征的预训练语言模型ForestBERT。然后,对预训练语言模型mT5进行微调,实现样本的自动标注,通过人工校正后,构建包含3个主题共2280个样本的林业抽取式问答数据集。基于该数据集对BERT、RoBERTa、MacBERT、PERT、ELECTRA、LERT 6个通用领域的中文预训练语言模型以及本研究构建的ForestBERT进行训练和验证,以明确ForestBERT的优势。为探究不同主题对模型性能的影响,分别基于林业术语、林业法律法规、林业文献3个主题数据集对所有模型进行微调。将ForestBERT与BERT在林业文献中的问答结果进行可视化比较,以更直观展现ForestBERT的优势。【结果】ForestBERT在林业领域的抽取式问答任务中整体表现优于其他6个对比模型,与基础模型BERT相比,精确匹配(EM)分数和F1分数分别提升1.6%和1.72%,在另外5个模型的平均性能上也均提升0.96%。在各个模型最优划分比例下,ForestBERT在EM上分别优于BERT和其他5个模型2.12%和1.2%,在F1上分别优于1.88%和1.26%。此外,ForestBERT在3个林业主题上也均表现优异,术语、法律法规、文献任务的评估分数分别比其他6个模型平均提升3.06%、1.73%、2.76%。在所有模型中,术语任务表现最佳,F1的平均值达到87.63%,表现较差的法律法规�【Objective】As for the problems of low utilization of forestry text,insufficient understanding of forestry knowledge by general-domain pre-trained language models,and the time-consuming nature of data annotation,this study makes full use of the massive forestry texts,proposes a pre-trained language model integrating forestry domain knowledge,and efficiently realizes the forestry extractive question answering by automatically annotating the training data,so as to provide intelligent information services for forestry decision-making and management.【Method】First,a forestry corpus was constructed using web crawler technology,encompassing three topics:terminology,law,and literature.This corpus was used to further pre-train the generaldomain pre-trained language model BERT.Through self-supervised learning of masked language model and next sentence prediction tasks,BERT was able to effectively learn forestry semantic information,resulting in the pre-trained language model ForestBERT,which has general features of forestry text.Subsequently,the pre-trained language model mT5 was fine-tuned to enable automatic labeling of samples.After manual correction,a forestry extractive question-answering dataset comprising 2280 samples across the three topics was constructed.Based on the dataset,the six general-domain Chinese pre-trained language models of BERT,RoBERTa,MacBERT,PERT,ELECTRA,and LERT,as well as ForestBERT that was specifically constructed in this study were trained and validated,to identify the advantages of ForestBERT.To investigate the impact of different topics on model performance,all models were fine-tuned on datasets related to the three topics:forestry terminology,forestry law,and forestry literature.Additionally,a visual comparison of the question-answering results in forestry literature between ForestBERT and BERT was performed to more intuitively demonstrate the advantages of ForestBERT.【Result】ForestBERT outperformed the other six comparison models in forestry extractive question-answering task.Co
关 键 词:林业文本 BERT 预训练语言模型 特定领域预训练 抽取式问答任务 自然语言处理
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...