基于BERT嵌入与知识蒸馏的层次化课程主题分析研究  

Research on Hierarchical Topic Analysis for a Course Based on BERT Embedding and Knowledge Distillation

在线阅读下载全文

作  者:郭振东 林民 李成城 GUO Zhendong;LIN Min;LI Chengcheng(College of Computer Science and Technology,Inner Mongolia Normal University,Hohhot,Inner Mongolia 010022,China)

机构地区:[1]内蒙古师范大学计算机科学技术学院,内蒙古呼和浩特010022

出  处:《中文信息学报》2024年第7期84-94,共11页Journal of Chinese Information Processing

基  金:国家自然科学基金(61806103,61562068);内蒙古自然科学基金(2017MS0607,2021LHMS06010);国家242信息安全专项(2019A114)。

摘  要:基于变分自编码器的树结构神经主题模型能有效挖掘文本的层次化语义特征,但现有的树结构神经主题模型仅利用了词频等统计特征,忽略了外部先验知识对获取主题的帮助。针对课程主题分析任务,该文融合迁移学习思想,提出了一种基于BERT嵌入与知识蒸馏的树结构神经主题模型。首先,通过构建BERT-CRF分词模型,使用少量领域文本对BERT进行二次训练,优化领域字的表示,动态融合二次训练后的BERT字嵌入,获取粗粒度领域词嵌入,缓解字粒度BERT嵌入与词袋表示不匹配问题;其次,针对词袋表示数据稀疏问题,以文档重构为目标,构建BERT自编码器,蒸馏有监督的文档表示,指导主题模型的文档重构学习,提升主题质量;最后,优化树结构神经主题模型以拟合富含辅助信息的BERT词嵌入,并用有监督的蒸馏知识指导无监督主题模型的文档重构。实验表明,基于BERT嵌入与知识蒸馏的树结构神经主题模型具有预训练模型和主题模型的优良特性,能对课程主题进行更有效的归纳总结。The tree-structured neural topic model based on the variational auto-encoder can effectively mine the hierarchical semantic features of the text.However,the existing tree-structured neural topic model only uses statistical features such as word frequency and ignores the prior external knowledge.Aiming at the topic analysis of a course,we propose a tree-structured neural topic model based on BERT embedding and knowledge distillation by integrating the idea of transfer learning.Firstly,the BERT-CRF word segmentation model is constructed,and a small amount of domain text is used to train BERT twice to optimize the representation of domain words.After the second training,the BERT word embedding is dynamically fused to obtain coarse-grained domain word embedding,alleviating the mismatch between word embedding and a bag-of-words representation.Secondly,the BERT autoencoder is constructed with document reconstruction as the goal to solve the problem of sparse bag-of-words representation data.The supervised document representation is distilled to guide the document reconstruction learning of the topic model and improve the document reconstruction of the quality of the topic.Finally,a tree-structured neural topic model is optimized to fit auxiliary information-rich BERT word embedding,and supervised distillation knowledge is used to guide the document reconstruction of the unsupervised topic model.Experiments show that the proposed method can summarize the course topics more effectively.

关 键 词:树结构神经主题模型 BERT 知识蒸馏 变分自编码器 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象