简单且有效的弱监督中文文本分类算法  

Simple and Effective Weakly Supervised Chinese Text Classification Algorithm

在线阅读下载全文

作  者:陈中涛 周亚同 CHEN Zhongtao;ZHOU Yatong(School of Electronic and Information Engineering,Hebei University of Technology,Tianjin 300401,China)

机构地区:[1]河北工业大学电子信息工程学院,天津300401

出  处:《计算机工程与应用》2025年第4期192-210,共19页Computer Engineering and Applications

摘  要:目前基于种子词的弱监督文本分类算法大多需要从数据集中搜索所有种子词并以此扩展类别词典,出现频率较低的种子词的类别识别能力也较低。因此设计了一个简单且有效的弱监督中文文本分类算法(simple and effective weakly supervised Chinese text classification,SEWClass)。该方法利用预训练语言模型初始权重生成对文本的抽象理解,并以此为基础继续生成抽象约束条件和具象约束条件,以构建初次训练的伪标签数据;根据类别数量联合构建降维模型与分类器,以适应弱监督文本分类需要预先指定类别和在自训练过程中需要增加训练数据的特点;通过两种约束条件,伪标签数据拥有较高精确率,并在自训练过程中仅训练降维模型以提升召回率和算法效率。SEWClass对每个类别只需要一个种子词,如类别名称,即可完成分类任务,且SEWClass的性能与种子词是否出现在数据集中无关。SEWClass在THUCNews与toutiao两个中文数据集上的性能均远高于其他弱监督算法。Most of the current weakly supervised text classification algorithms based on seed words need to search all seed words from the dataset and extend the category dictionary in this way,and the category recognition ability of seed words that occur less frequently is also lower.Therefore,a simple and effective weakly supervised Chinese text classifi-cation(SEWClass)algorithm is designed,which uses the initial weights of the pre-trained language model to generate an abstract understanding of the text and continues to generate abstract constraints and figurative constraints based on this to construct the initial training.Based on the number of categories,a dimensionality reduction model and a classifier are jointly constructed to adapt to the fact that the weakly supervised text classification needs to be specified in advance,and needs to increase training data during self-training.With the two constraints,the pseudo-labeled data have a high precision rate,and only the dimensionality reduction model is trained during self-training to improve the recall and efficiency.SEWClass requires only one seed word,such as the category name,to complete the classification task,and the perfor-mance of SEWClass is independent whether or not the seed word occurs in the dataset.The performance of SEWClass on both Chinese datasets,THUCNews and toutiao,is much higher than that of other weakly supervised algorithms.

关 键 词:弱监督 文本分类 自训练 种子词 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象