检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:刘刚[1,2] 王同礼[1,2] 唐宏伟 战凯 杨雯莉 LIU Gang;WANG Tongli;TANG Hongwei;ZHAN Kai;YANG Wenli(College of Computer Science and Technology,Harbin Engineering University,Harbin 150001,China;Modeling and Emulation in E-Government National Engineering Laboratory,Harbin Engineering University,Harbin 150001,China;PwC Enterprise Digital,PricewaterhouseCoopers,Sydney 2070,Australia)
机构地区:[1]哈尔滨工程大学计算机科学与技术学院,哈尔滨150001 [2]哈尔滨工程大学电子政务建模仿真国家工程实验室,哈尔滨150001 [3]澳大利亚普华永道公司普华永道数字化部,悉尼2070
出 处:《计算机工程与应用》2024年第1期154-164,共11页Computer Engineering and Applications
基 金:黑龙江省高等教育教学改革研究项目(SJGZ20200044);黑龙江省自然科学基金(LH2021F015);国家高端外国专家引进计划项目(G2021180008L)。
摘 要:目前的主题模型大多数基于自身文本的词共现信息进行建模,并没有引入主题的稀疏约束来提升模型的主题抽取能力,此外短文本本身存在词共现稀疏的问题,该问题严重影响了短文本主题建模的准确性。针对以上问题,提出了一种增强上下文神经主题模型(enhanced context neural topic model,ECNTM)。ECNTM基于主题控制器对主题进行稀疏性约束,过滤掉不相关的主题,同时模型的输入变成BOW向量和SBERT句子嵌入的拼接,在高斯解码器中,通过在嵌入空间中将单词上的主题分布处理为多元高斯分布或高斯混合分布,显式地丰富了短文本有限的上下文信息,解决了短文本词共现特征稀疏问题。在WS、Reuters、KOS、20 NewsGroups四个公开数据集上的实验结果表明,该模型在困惑度、主题一致性以及文本分类准确率上相较基准模型均有明显提升,证明了引入主题稀疏约束特性以及丰富的上下文信息到短文本主题建模的有效性。Most of the current topic models are modeled based on word co-occurrence information of their own texts,and do not introduce topic sparsity constraints to improve the model’s topic extraction ability.In addition,short texts have the problem of word co-occurrence sparsity,which seriously affects accuracy of short text topic modeling.In response to the above problems,an enhanced context neural topic model(ECNTM)is proposed.ECNTM implements sparsity constraints on the topic based on the topic controller to filter out irrelevant topics.At the same time,the input of the model becomes the splicing of BOW vector and SBERT sentence embedding.In the Gaussian decoder,the topic on the word is embedded in the embedding space.The distribution is treated as a multivariate Gaussian distribution or a Gaussian mixture distribution,which explicitly enriches the limited context information of short texts and solves the problem of sparse word co-occurrence features in short texts.Experimental results on four public datasets of WS,Reuters,KOS and 20 NewsGroups show that this model has significantly improved compared with the benchmark model in terms of perplexity,topic consistency,and text classification accuracy,which proves the introduction of topic sparsity constraints and rich effectiveness of con-textual information to short text topic modeling.
关 键 词:神经主题模型 短文本 稀疏约束 变分自编码器 主题建模
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.229