融合后验概率校准训练的文本分类算法  

Integrating posterior probability calibration training into text classification algorithm

在线阅读下载全文

作  者:江静 陈渝 孙界平[1] 琚生根[1] JIANG Jing;CHEN Yu;SUN Jieping;JU Shenggen(College of Computer Science,Sichuan University,Chengdu Sichuan 610065,China;College of Science and Technology,Sichuan Minzu College,Kangding Sichuan 626001,China)

机构地区:[1]四川大学计算机学院,成都610065 [2]四川民族学院理工学院,四川康定626001

出  处:《计算机应用》2022年第6期1789-1795,共7页journal of Computer Applications

基  金:国家自然科学基金资助项目(61972270);四川省重点研发项目(2019YFG0521)。

摘  要:用于文本表示的预训练语言模型在各种文本分类任务上实现了较高的准确率,但仍然存在以下问题:一方面,预训练语言模型在计算出所有类别的后验概率后选择后验概率最大的类别作为其最终分类结果,然而在很多场景下,后验概率的质量能比分类结果提供更多的可靠信息;另一方面,预训练语言模型的分类器在为语义相似的文本分配不同标签时会出现性能下降的情况。针对上述两个问题,提出一种后验概率校准结合负例监督的模型PosCalnegative。该模型端到端地在训练过程中动态地对预测概率和经验后验概率之间的差异进行惩罚,并在训练过程中利用带有不同标签的文本来实现对编码器的负例监督,从而为每个类别生成不同的特征向量表示。实验结果表明:PosCal-negative模型在两个中文母婴护理文本分类数据集MATINF-C-AGE和MATINF-C-TOPIC的分类准确率分别达到了91.55%和69.19%,相比ERNIE模型分别提高了1.13个百分点和2.53个百分点。The pre-training language models used for text representation have achieved high accuracy on various text classification tasks,but the following problems still remain:on the one hand,the category with the largest posterior probability is selected as the final classification result of the model after calculating the posterior probabilities on all categories in the pre-training language model.However,in many scenarios,the quality of the posterior probability itself can provide more reliable information than the final classification result.On the other hand,the classifier of the pre-training language model has performance degradation when assigning different labels to texts with similar semantics.In response to the above two problems,a model combining posterior probability calibration and negative example supervision named PosCal-negative was proposed.In PosCal-negative model,the difference between the predicted probability and the empirical posterior probability was dynamically penalized in an end-to-and way during the training process,and the texts with different labels were used to realize the negative supervision of the encoder,so that different feature vector representations were generated for different categories.Experimental results show that the classification accuracies of the proposed model on two Chinese maternal and child care text classification datasets MATINF-C-AGE and MATINF-C-TOPIC reach 91.55%and 69.19%respectively,which are 1.13 percentage points and 2.53 percentage points higher than those of Enhanced Representation through kNowledge IntEgration(ERNIE)model respectively.

关 键 词:文本分类 后验概率校准 预训练语言模型 负例监督 深度学习 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象