基于弱标签争议的半自动分类数据标注方法被引量：1

The Semi-Automatic Classification Data Labeling Method Based on Dispute About Weak Label

作　　者：李自强杨薇[2] 杨先凤[2] 罗林 LI Zi-qiang;YANG Wei;YANG Xian-feng;LUO Lin(College of Movie and Media,Sichuan Normal University,Chengdu,Sichuan 610066,China;School of Computer Science and Software Engineering,Southwest Petroleum University,Chengdu,Sichuan 610500,China;Chengdu R&D Center,Tellhow Software,Chengdu,Sichuan 610041,China)

机构地区：[1]四川师范大学影视与传媒学院,四川成都610066 [2]西南石油大学计算机与软件学院,四川成都610500 [3]泰豪软件股份有限公司成都研发中心,四川成都610041

出　　处：《电子学报》2024年第8期2891-2899,共9页Acta Electronica Sinica

基　　金：国家自然科学基金(No.61802321);四川省科技厅重点研发计划(No.2020YFN0019)。

摘　　要：当前,深度主动学习(Deep Active Learning,DAL)在分类数据标注工作中获得成功,但如何筛选出最能提升模型性能的样本仍是难题.本文提出基于弱标签争议的半自动分类数据标注方法(Dispute about Weak Label based Deep Active Learning,DWLDAL),迭代地筛选出模型难以区分的样本,交给人工进行准确标注.该方法包含伪标签生成器和弱标签生成器,伪标签生成器是在准确标注的数据集上训练而成,用于生成无标签数据的伪标签;弱标签生成器则是在带伪标签的随机子集上训练而成.弱标签生成器委员会决定哪些无标签数据最有争议,则交给人工标注.本文针对文本分类问题,在公开数据集IMDB(Internet Movie DataBase)、20NEWS(20NEW Sgroup)和chnsenticorp(chnsenticorp_htl_all)上进行实验验证.从数据标注和分类任务的准确性2个角度,对3种不同投票决策方式进行评估.DWLDAL方法中数据标注的F1分数比现有方法Snuba分别提高30.22%、14.07%和2.57%,DWLDAL方法中分类任务的F1分数比Snuba分别提高1.01%、22.72%和4.83%.At present,deep active learning(DAL)in the classification data labeling work has achieved outstanding success.How to select samples to improve the performance of models is still a difficult problem in deep active learning.We proposes a semi-automatic classification data labeling method based on weak label dispute(Dispute about Weak Labelbased Deep Active Learning,DWLDAL).The method iteratively selects samples that is difficult for model to distinguish,and manually annotate these sample.This method contains pseudo label generator and weak label generator,pseudo label generator is trained on accurately annotated datasets to generate pseudo label for unlabeled data;weak label generator is trained on random data subset with pseudo labels.Weak label generator committee are used to determine which unlabeled data is the most controversial and should be manually annotated.We conducted experimental validation on the common da⁃tasets IMDB(Internet Movie Database),20NEWS(20NEWSgroup),and chnsenticorp(chnsenticorp_htl_all)to address the issue of text classification.Three different voting decision-making methods are evaluated from the perspective of the accura⁃cy of data annotation and classification tasks.The F1 score of data annotation in DWLDAL method is 30.22%,14.07%and 2.57%higher than that in the existing method Snuba,respectively.The F1 score of classification task in DWLDAL method is 1.01%,22.72%and 4.83%higher than that in Snuba method,respectively.

关键词：深度主动学习文本分类伪标签生成器弱标签生成器投票委员会

分类号：TP391.1[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于弱标签争议的半自动分类数据标注方法被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于弱标签争议的半自动分类数据标注方法 被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于弱标签争议的半自动分类数据标注方法被引量：1