基于BERT和深度主动学习的农业新闻文本分类方法被引量：1

A Classification Method of Agricultural News Text Based on BERT and Deep Active Learning

作　　者：石运来崔运鹏[1] 杜志钢 SHI Yunlai;CUI Yunpeng;DU Zhigang(Agricultural Information Institute of CAAS,Beijing 100081;Zibo Digital Agricultural Rural Development Center,Zibo 255000)

机构地区：[1]中国农业科学院农业信息研究所,北京100081 [2]淄博市数字农业农村发展中心,淄博255000

出　　处：《农业图书情报学报》2022年第8期19-29,共11页Journal of Library and Information Science in Agriculture

基　　金：国家科技图书文献中心(NSTL)文献专项任务(2021XM45)。

摘　　要：[目的/意义]当前农业新闻分类研究中的模型训练以被动学习方式居多,普遍存在数据无法即时标注及标注成本过高的问题,对农业新闻分析工作也造成了一定阻碍。为解决该问题,运用主动学习或者深度主动学习技术从未标注数据中选择更有价值和代表性的数据进行人工标注并构建标注数据集,提升农业新闻挖掘工作效率和效果。[方法/过程]将文本分类常用的机器学习模型结合主动学习方法分析提升效果,以及使用BERT模型结合3种采样策略进行深度主动学习训练,在共19847条样本的新闻爬虫语料上以筛选出农业相关新闻为目标,通过每轮增加30个样本标注的迭代实验进行测试。[结果/结论]实验结果表明:主动学习方法的应用对各个模型的训练过程均有明显提升。其中BERT模型配合判别性主动学习采样函数,具有最优的新闻文本分类效果和最低的标注数据需求。[Purpose/Significance]At present,most of the training models used in the research of news classification are non-active learning.There are common problems about these models,including data cannot be labeled immediately and the labeling cost is too high,which also hinders the analysis of agricultural news.Especially because of the explosive growth of news data in the network era,it is more difficult to label data,train supervised text classification models,and screen relevant news in the field of agriculture from diversified online news sources.In order to solve this problem,the most commonly used pool based active learning or deep active learning technique is used to select more valuable and representative data from unlabeled data for manual labeling,and construct labeled data sets to improve the efficiency and effect of news classification and agricultural news mining.[Method/Process]The commonly used machine learning models for text classification,such as random forest classifier,polynomial naive Bayes classifier and logistic regression classifier,were combined with the active learning method with the lowest confidence to analyze the effect,and the BERT model was combined with the three sampling strategies of discriminative active learning,deep Bayes active learning and lowest confidence for deep active learning training.On the news corpus of 19847 samples crawled and cleaned by crawler technology from Sina and other news websites,aiming at screening agricultural related news from diversified news samples of various topics,the iterative experiment of adding 30 samples per round was tested to check the improvement effect of F1 score under various method combinations with the increase of the number of annotation.In addition,the representativeness and diversity of the samples selected by the sampling function of each method in the deep active learning method of the BERT model were compared,so as to understand the characteristics of each strategy and provide inspiration for the selection and improvement of Al strat

关键词：深度学习农业新闻文本分类 BERT模型主动学习

分类号：TP391.1[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于BERT和深度主动学习的农业新闻文本分类方法被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于BERT和深度主动学习的农业新闻文本分类方法 被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于BERT和深度主动学习的农业新闻文本分类方法被引量：1