融合群助教模型的两阶段知识蒸馏文本分类方法  

Incorporating two-stage knowledge distillation text classification method with group assistant models

在线阅读下载全文

作  者:张骏强 高尚兵 苏睿 李文婷 ZHANG Junqiang;GAO Shangbing;SU Rui;LI Wenting(School of Computer and Software Engineering,Huaiyin Institute of Technology,Huai'an 223003,China;Laboratory for Internet of Things and Mobile Internet Technology of Jiangsu Province,Huai'an 223001,China)

机构地区:[1]淮阴工学院计算机与软件工程学院,江苏淮安223003 [2]江苏省物联网移动互联技术工程实验室,江苏淮安223001

出  处:《常州大学学报(自然科学版)》2024年第6期71-82,共12页Journal of Changzhou University:Natural Science Edition

基  金:国家重点研发计划资助项目(2018YFB1004904);国家自然科学基金面上资助项目(62076107);江苏省六大人才高峰资助项目(XYDXXJS-011)。

摘  要:针对Transformer架构的预训练语言模型进行文本分类时性能较优的模型存在参数量多、训练开销大以及推理时延高的问题,提出了一种融合群助教模型的两阶段知识蒸馏文本分类方法,其中群助教模型(Group assistant models,GAM)由图卷积神经网络助教模型(Graph convolution network assistant model,GCNAM)和Transformer助教模型组成,该方法将教师模型的知识经过Transformer助教模型传递蒸馏到学生模型中,期间通过图卷积神经网络助教模型对两阶段蒸馏过程进行指导。同时,针对模型中间层的知识蒸馏,提出了一种渐进式知识蒸馏策略,根据模型知识分布密度调整教师模型被蒸馏的层级。根据多个数据集的实验结果,文中方法均优于基线方法,并以最高损失0.73%的F 1值为代价,将模型参数量降低了48.20%,推理速度提升了56.94%。In the case of pre-trained language models using the Transformer architecture for text classification tasks,the better performing models suffer from a high number of parameters,huge training overhead,and high inference latency.This paper proposes a two-stage knowledge distillation text classification method incorporating a group teaching assistant model,in which the group teaching assistant model consists of a graph convolutional neural network teaching assistant model and a Transformer teaching assistant model.The knowledge of the teacher model is distilled to the student model by the Transformer assistant model,during which the two-stage distillation process is guided by the graph convolutional neural network assistant model.At the same time,a progressive knowledge distillation strategy is proposed for the intermediate knowledge distillation of the model,which adjusts the level at which a specific teacher model is distilled according to the model knowledge distribution density.Experimental results on multiple datasets show that the proposed approach outperforms the baseline approach in all cases and reduces the size of the model parameters by 48.20%and increases the speed of inference by 56.94%at the cost of a maximum loss of 0.73%of the F 1-score value.

关 键 词:文本分类 预训练语言模型 两阶段知识蒸馏 群助教模型 渐进式蒸馏 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象