基于主题树的微博突发话题检测被引量：6

Microblog bursty topic detection based on topic tree

机构地区：[1]辽宁工程技术大学软件学院,辽宁葫芦岛125100 [2]辽宁工程技术大学系统工程研究所,辽宁葫芦岛125100

出　　处：《计算机应用》2014年第8期2332-2335,共4页journal of Computer Applications

基　　金：国家自然科学基金资助项目(70971059);辽宁省创新团队项目(2009T045);辽宁省高等学校杰出青年学者成长计划项目(LJQ2012027)

摘　　要：针对传统话题检测方法不能很好处理微博中用语不规范、随意性强、指代不明确以及存在大量网络用语的问题,提出了一种基于潜在狄利克雷分配(LDA)模型的主题树检测方法。首先,运用自然语言处理(NLP)中增大信息熵的方法将相关微博整理成一棵主题树,配合狄利克雷先验α与经验值β随主题数目动态变化的设计思想,结合该模型独特的双重概率统计模式,实现了对文本中每个词"贡献度"的统计,提前处理掉干扰信息,排除垃圾数据对话题检测的影响;然后,利用该"贡献度"作为空间向量模型(VSM)改进后的参数值计算文档间相似度来提取突发话题,达到提高突发话题检测精准度的目的。提出的基于LDA模型的主题树检测方法从F值比对与人工检测两个角度进行了相关实验,实验数据显示该算法不仅可以检测到突发话题,而且获得的结果与知网模型和TF-IDF算法相比分别高出3%、7%,且更符合人的判断逻辑。A kind of topic tree detection method based on Latent Dirichlet Allocation （LDA） model was put forward, in order to solve the problems of nonstandard terms, randomness, uncertainty of reference and large number of network terms in microblog texts, which can not be solved in traditional detection method. Relevant microblogs were reorganized into a topic tree by increasing information entropy in Natural Language Processing （NLP）, combining with the design idea that Dirichelet prior experience value α and experience value β vary with the topic number, then the contribution statistics of every word in the text was achieved using the specific dual probability statistical method of this model. Thus, the interference information would be disposed in advance and the influence of garbage data on topic detection was excluded. Using this contribution as the parameter value of the improved Vector Space Model （VSM）, bursty topics were extracted through calculating the similarity between texts, in order to improve the detection precision of bursty topics. Experiments of the proposed detection method were made from two aspects： comparison of the value of F and the manual detection. The experimental data show that, this algorithm not only can detect the bursty topics, but also can improve the precision about 3% and 7% respectively compared with the HowNet model and the TF-IDF （Term Frequency-Inverse Document Frequency） algorithm, and it is more in accordance with human＇s logic judgments than the traditional ones.

关键词：潜在狄利克雷分配主题树语义相似度空间向量模型话题检测

分类号：TP391[自动化与计算机技术—计算机应用技术] TP18[自动化与计算机技术—计算机科学与技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于主题树的微博突发话题检测被引量：6

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于主题树的微博突发话题检测 被引量：6

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于主题树的微博突发话题检测被引量：6