检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:陈中涛 周亚同 CHEN Zhongtao;ZHOU Yatong(School of Electronic and Information Engineering,Hebei University of Technology,Tianjin 300401,China)
机构地区:[1]河北工业大学电子信息工程学院,天津300401
出 处:《计算机工程与应用》2025年第4期192-210,共19页Computer Engineering and Applications
摘 要:目前基于种子词的弱监督文本分类算法大多需要从数据集中搜索所有种子词并以此扩展类别词典,出现频率较低的种子词的类别识别能力也较低。因此设计了一个简单且有效的弱监督中文文本分类算法(simple and effective weakly supervised Chinese text classification,SEWClass)。该方法利用预训练语言模型初始权重生成对文本的抽象理解,并以此为基础继续生成抽象约束条件和具象约束条件,以构建初次训练的伪标签数据;根据类别数量联合构建降维模型与分类器,以适应弱监督文本分类需要预先指定类别和在自训练过程中需要增加训练数据的特点;通过两种约束条件,伪标签数据拥有较高精确率,并在自训练过程中仅训练降维模型以提升召回率和算法效率。SEWClass对每个类别只需要一个种子词,如类别名称,即可完成分类任务,且SEWClass的性能与种子词是否出现在数据集中无关。SEWClass在THUCNews与toutiao两个中文数据集上的性能均远高于其他弱监督算法。Most of the current weakly supervised text classification algorithms based on seed words need to search all seed words from the dataset and extend the category dictionary in this way,and the category recognition ability of seed words that occur less frequently is also lower.Therefore,a simple and effective weakly supervised Chinese text classifi-cation(SEWClass)algorithm is designed,which uses the initial weights of the pre-trained language model to generate an abstract understanding of the text and continues to generate abstract constraints and figurative constraints based on this to construct the initial training.Based on the number of categories,a dimensionality reduction model and a classifier are jointly constructed to adapt to the fact that the weakly supervised text classification needs to be specified in advance,and needs to increase training data during self-training.With the two constraints,the pseudo-labeled data have a high precision rate,and only the dimensionality reduction model is trained during self-training to improve the recall and efficiency.SEWClass requires only one seed word,such as the category name,to complete the classification task,and the perfor-mance of SEWClass is independent whether or not the seed word occurs in the dataset.The performance of SEWClass on both Chinese datasets,THUCNews and toutiao,is much higher than that of other weakly supervised algorithms.
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.38