检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:石运来 崔运鹏[1] 杜志钢 SHI Yunlai;CUI Yunpeng;DU Zhigang(Agricultural Information Institute of CAAS,Beijing 100081;Zibo Digital Agricultural Rural Development Center,Zibo 255000)
机构地区:[1]中国农业科学院农业信息研究所,北京100081 [2]淄博市数字农业农村发展中心,淄博255000
出 处:《农业图书情报学报》2022年第8期19-29,共11页Journal of Library and Information Science in Agriculture
基 金:国家科技图书文献中心(NSTL)文献专项任务(2021XM45)。
摘 要:[目的/意义]当前农业新闻分类研究中的模型训练以被动学习方式居多,普遍存在数据无法即时标注及标注成本过高的问题,对农业新闻分析工作也造成了一定阻碍。为解决该问题,运用主动学习或者深度主动学习技术从未标注数据中选择更有价值和代表性的数据进行人工标注并构建标注数据集,提升农业新闻挖掘工作效率和效果。[方法/过程]将文本分类常用的机器学习模型结合主动学习方法分析提升效果,以及使用BERT模型结合3种采样策略进行深度主动学习训练,在共19847条样本的新闻爬虫语料上以筛选出农业相关新闻为目标,通过每轮增加30个样本标注的迭代实验进行测试。[结果/结论]实验结果表明:主动学习方法的应用对各个模型的训练过程均有明显提升。其中BERT模型配合判别性主动学习采样函数,具有最优的新闻文本分类效果和最低的标注数据需求。[Purpose/Significance]At present,most of the training models used in the research of news classification are non-active learning.There are common problems about these models,including data cannot be labeled immediately and the labeling cost is too high,which also hinders the analysis of agricultural news.Especially because of the explosive growth of news data in the network era,it is more difficult to label data,train supervised text classification models,and screen relevant news in the field of agriculture from diversified online news sources.In order to solve this problem,the most commonly used pool based active learning or deep active learning technique is used to select more valuable and representative data from unlabeled data for manual labeling,and construct labeled data sets to improve the efficiency and effect of news classification and agricultural news mining.[Method/Process]The commonly used machine learning models for text classification,such as random forest classifier,polynomial naive Bayes classifier and logistic regression classifier,were combined with the active learning method with the lowest confidence to analyze the effect,and the BERT model was combined with the three sampling strategies of discriminative active learning,deep Bayes active learning and lowest confidence for deep active learning training.On the news corpus of 19847 samples crawled and cleaned by crawler technology from Sina and other news websites,aiming at screening agricultural related news from diversified news samples of various topics,the iterative experiment of adding 30 samples per round was tested to check the improvement effect of F1 score under various method combinations with the increase of the number of annotation.In addition,the representativeness and diversity of the samples selected by the sampling function of each method in the deep active learning method of the BERT model were compared,so as to understand the characteristics of each strategy and provide inspiration for the selection and improvement of Al strat
关 键 词:深度学习 农业新闻 文本分类 BERT模型 主动学习
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.117