检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:林呈宇 王雷[1] 薛聪[1] LIN Chengyu;WANG Lei;XUE Cong(Institute of Information Engineering,Chinese Academy of Sciences,Beijing 100093,China;School of Cyber Security,University of Chinese Academy of Sciences,Beijing 100049,China)
机构地区:[1]中国科学院信息工程研究所,北京100093 [2]中国科学院大学网络空间安全学院,北京100049
出 处:《计算机应用》2023年第2期335-342,共8页journal of Computer Applications
基 金:国家自然科学基金重点项目(U1636220)。
摘 要:针对弱监督文本分类任务中存在的类别词表噪声和标签噪声问题,提出了一种标签语义增强的弱监督文本分类模型。首先,基于单词上下文语义表示对类别词表去噪,从而构建高度准确的类别词表;然后,构建基于MASK机制的词类别预测任务对预训练模型BERT进行微调,以学习单词与类别的关系;最后,利用引入标签语义的自训练模块来充分利用所有数据信息并减少标签噪声的影响,以实现词级到句子级语义的转换,从而准确预测文本序列类别。实验结果表明,与目前最先进的弱监督文本分类模型LOTClass相比,所提方法在THUCNews、AG News和IMDB公开数据集上,分类准确率分别提高了5.29、1.41和1.86个百分点。Aiming at the problem of category vocabulary noise and label noise in weakly-supervised text classification tasks,a weakly-supervised text classification model with label semantic enhancement was proposed.Firstly,the category vocabulary was denoised on the basis of the contextual semantic representation of the words in order to construct a highly accurate category vocabulary.Then,a word category prediction task based on MASK mechanism was constructed to finetune the pre-training model BERT(Bidirectional Encoder Representations from Transformers),so as to learn the relationship between words and categories.Finally,a self-training module with label semantics introduced was used to make full use of all data information and reduce the impact of label noise in order to achieve word-level to sentence-level semantic conversion,thereby accurately predicting text sequence categories.Experimental results show that compared with the current state-of-the-art weakly-supervised text classification model LOTClass(Label-name-Only Text Classification),the proposed method improves the classification accuracy by 5.29,1.41 and 1.86 percentage points respectively on the public datasets THUCNews,AG News and IMDB.
关 键 词:弱监督文本分类 BERT MASK机制 标签语义 标签噪声 自训练
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.12