面向短文本的增强上下文神经主题模型

Enhanced Contextual Neural Topic Model for Short Texts

作　　者：刘刚[1,2] 王同礼[1,2] 唐宏伟战凯杨雯莉 LIU Gang;WANG Tongli;TANG Hongwei;ZHAN Kai;YANG Wenli(College of Computer Science and Technology,Harbin Engineering University,Harbin 150001,China;Modeling and Emulation in E-Government National Engineering Laboratory,Harbin Engineering University,Harbin 150001,China;PwC Enterprise Digital,PricewaterhouseCoopers,Sydney 2070,Australia)

机构地区：[1]哈尔滨工程大学计算机科学与技术学院,哈尔滨150001 [2]哈尔滨工程大学电子政务建模仿真国家工程实验室,哈尔滨150001 [3]澳大利亚普华永道公司普华永道数字化部,悉尼2070

出　　处：《计算机工程与应用》2024年第1期154-164,共11页Computer Engineering and Applications

基　　金：黑龙江省高等教育教学改革研究项目(SJGZ20200044);黑龙江省自然科学基金(LH2021F015);国家高端外国专家引进计划项目(G2021180008L)。

摘　　要：目前的主题模型大多数基于自身文本的词共现信息进行建模,并没有引入主题的稀疏约束来提升模型的主题抽取能力,此外短文本本身存在词共现稀疏的问题,该问题严重影响了短文本主题建模的准确性。针对以上问题,提出了一种增强上下文神经主题模型(enhanced context neural topic model,ECNTM)。ECNTM基于主题控制器对主题进行稀疏性约束,过滤掉不相关的主题,同时模型的输入变成BOW向量和SBERT句子嵌入的拼接,在高斯解码器中,通过在嵌入空间中将单词上的主题分布处理为多元高斯分布或高斯混合分布,显式地丰富了短文本有限的上下文信息,解决了短文本词共现特征稀疏问题。在WS、Reuters、KOS、20 NewsGroups四个公开数据集上的实验结果表明,该模型在困惑度、主题一致性以及文本分类准确率上相较基准模型均有明显提升,证明了引入主题稀疏约束特性以及丰富的上下文信息到短文本主题建模的有效性。Most of the current topic models are modeled based on word co-occurrence information of their own texts,and do not introduce topic sparsity constraints to improve the model’s topic extraction ability.In addition,short texts have the problem of word co-occurrence sparsity,which seriously affects accuracy of short text topic modeling.In response to the above problems,an enhanced context neural topic model(ECNTM)is proposed.ECNTM implements sparsity constraints on the topic based on the topic controller to filter out irrelevant topics.At the same time,the input of the model becomes the splicing of BOW vector and SBERT sentence embedding.In the Gaussian decoder,the topic on the word is embedded in the embedding space.The distribution is treated as a multivariate Gaussian distribution or a Gaussian mixture distribution,which explicitly enriches the limited context information of short texts and solves the problem of sparse word co-occurrence features in short texts.Experimental results on four public datasets of WS,Reuters,KOS and 20 NewsGroups show that this model has significantly improved compared with the benchmark model in terms of perplexity,topic consistency,and text classification accuracy,which proves the introduction of topic sparsity constraints and rich effectiveness of con-textual information to short text topic modeling.

关键词：神经主题模型短文本稀疏约束变分自编码器主题建模

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

面向短文本的增强上下文神经主题模型

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

面向短文本的增强上下文神经主题模型

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索