基于BERT和深度主动学习的农业新闻文本分类方法  被引量:1

A Classification Method of Agricultural News Text Based on BERT and Deep Active Learning

在线阅读下载全文

作  者:石运来 崔运鹏[1] 杜志钢 SHI Yunlai;CUI Yunpeng;DU Zhigang(Agricultural Information Institute of CAAS,Beijing 100081;Zibo Digital Agricultural Rural Development Center,Zibo 255000)

机构地区:[1]中国农业科学院农业信息研究所,北京100081 [2]淄博市数字农业农村发展中心,淄博255000

出  处:《农业图书情报学报》2022年第8期19-29,共11页Journal of Library and Information Science in Agriculture

基  金:国家科技图书文献中心(NSTL)文献专项任务(2021XM45)。

摘  要:[目的/意义]当前农业新闻分类研究中的模型训练以被动学习方式居多,普遍存在数据无法即时标注及标注成本过高的问题,对农业新闻分析工作也造成了一定阻碍。为解决该问题,运用主动学习或者深度主动学习技术从未标注数据中选择更有价值和代表性的数据进行人工标注并构建标注数据集,提升农业新闻挖掘工作效率和效果。[方法/过程]将文本分类常用的机器学习模型结合主动学习方法分析提升效果,以及使用BERT模型结合3种采样策略进行深度主动学习训练,在共19847条样本的新闻爬虫语料上以筛选出农业相关新闻为目标,通过每轮增加30个样本标注的迭代实验进行测试。[结果/结论]实验结果表明:主动学习方法的应用对各个模型的训练过程均有明显提升。其中BERT模型配合判别性主动学习采样函数,具有最优的新闻文本分类效果和最低的标注数据需求。[Purpose/Significance]At present,most of the training models used in the research of news classification are non-active learning.There are common problems about these models,including data cannot be labeled immediately and the labeling cost is too high,which also hinders the analysis of agricultural news.Especially because of the explosive growth of news data in the network era,it is more difficult to label data,train supervised text classification models,and screen relevant news in the field of agriculture from diversified online news sources.In order to solve this problem,the most commonly used pool based active learning or deep active learning technique is used to select more valuable and representative data from unlabeled data for manual labeling,and construct labeled data sets to improve the efficiency and effect of news classification and agricultural news mining.[Method/Process]The commonly used machine learning models for text classification,such as random forest classifier,polynomial naive Bayes classifier and logistic regression classifier,were combined with the active learning method with the lowest confidence to analyze the effect,and the BERT model was combined with the three sampling strategies of discriminative active learning,deep Bayes active learning and lowest confidence for deep active learning training.On the news corpus of 19847 samples crawled and cleaned by crawler technology from Sina and other news websites,aiming at screening agricultural related news from diversified news samples of various topics,the iterative experiment of adding 30 samples per round was tested to check the improvement effect of F1 score under various method combinations with the increase of the number of annotation.In addition,the representativeness and diversity of the samples selected by the sampling function of each method in the deep active learning method of the BERT model were compared,so as to understand the characteristics of each strategy and provide inspiration for the selection and improvement of Al strat

关 键 词:深度学习 农业新闻 文本分类 BERT模型 主动学习 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象