基于狄利克雷多项分配模型的多源文本主题挖掘模型被引量：1

Multi-source text topic mining model based on Dirichlet multinomial allocation model

作　　者：徐立洋黄瑞章[1,2,3] 陈艳平钱志森黎万英 XU Liyang;HUANG Ruizhang;CHEN Yanping;QIAN Zhisen;LI Wanying(College of Computer Science and Technology,Guizhou University,Guiyang Guizhou 550025,China;Guizhou Provincial Key Laboratory of Public Big Data(Guizhou University),Guiyang Guizhou 550025,China;State Key Laboratory for Novel Software Technology(Nanjing University),Nanjing Jiangsu 210093,China)

机构地区：[1]贵州大学计算机科学与技术学院,贵阳550025 [2]贵州省公共大数据重点实验室(贵州大学),贵阳550025 [3]计算机软件新技术国家重点实验室(南京大学),南京210093

出　　处：《计算机应用》2018年第11期3094-3099,3104,共7页journal of Computer Applications

基　　金：国家自然科学基金资助项目(61462011);国家自然科学基金重大研究计划项目(91746116);贵州省重大应用基础研究项目(黔科合JZ字[2014]2001);贵州省科技重大专项计划项目(黔科合重大专项字[2017]3002);贵州省自然科学基金资助项目(黔科合基础[2018]1035)~~

摘　　要：随着文本数据来源渠道越来越丰富,面向多源文本数据进行主题挖掘已成为文本挖掘领域的研究重点。由于传统主题模型主要面向单源文本数据建模,直接应用于多源文本数据有较多的限制。针对该问题提出了基于狄利克雷多项分配(DMA)模型的多源文本主题挖掘模型——多源狄利克雷多项分配模型(MSDMA)。通过考虑主题在不同数据源的词分布的差异性,结合DMA模型的非参聚类性质,模型主要解决了如下三个问题:1)能够学习出同一个主题在不同数据源中特有的词分布形式;2)通过数据源之间共享主题空间和词项空间,使得数据源间可进行主题知识互补,提升对高噪声、低信息量的数据源的主题发现效果;3)能自主学习出每个数据源内的主题数量,不需要事先给定主题个数。最后通过在模拟数据集和真实数据集的实验结果表明,所提模型比传统主题模型能更有效地对多源数据进行主题信息挖掘。With the rapid increase of text data sources,topic mining for multi-source text data becomes the research focus of text mining.Since the traditional topic model is mainly oriented to single-source,there are many limitations to directly apply to multi-source.Therefore,a topic model for multi-source based on Dirichlet Multinomial Allocation model(DMA)was proposed considering the difference between sources of topic word-distribution and the nonparametric clustering quality of DMA,namely MSDMA(Multi-Source Dirichlet Multinomial Allocation).The main contributions of the proposed model are as follows:1)it takes into account the characteristics of each source itself when modeling the topic,and can learn the source-specific word distributions of topic k;2)it can improve the topic discovery performance of high noise and low information through knowledge sharing;3)it can automatically learn the number of topics within each source without the need for human pre-given.The experimental results in the simulated data set and two real datasets indicate that the proposed model can extract topic information more effectively and efficiently than the state-of-the-art topic models.

关键词：多源文本数据主题模型吉布斯采样狄利克雷多项分配模型文本挖掘

分类号：TP301.6[自动化与计算机技术—计算机系统结构]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于狄利克雷多项分配模型的多源文本主题挖掘模型被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于狄利克雷多项分配模型的多源文本主题挖掘模型 被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于狄利克雷多项分配模型的多源文本主题挖掘模型被引量：1