基于改进K-均值的微博热点话题发现方法  

Micro-blog hot topic detection method based on improved K-means

在线阅读下载全文

作  者:陈阳键[1] 温秋华 CHEN Yangjian;WEN Qiuhua(Digital Service Center,Guangzhou Radio and Television University,Guangzhou Guangdong 510000,China;School of Information Science and Technology,Jinan University,Guangzhou Guangdong 510000,China)

机构地区:[1]广州开放大学(广州市广播电视大学)数字化服务中心,广东广州510000 [2]暨南大学信息科学技术学院,广东广州510000

出  处:《太赫兹科学与电子信息学报》2023年第3期378-383,391,共7页Journal of Terahertz Science and Electronic Information Technology

基  金:广东省广州市高校第九批教育教学改革基金资助项目(2017F10)。

摘  要:微博文本数据高维度、同义、多义特征明显,传统基于向量空间模型(VSM)联合K-均值的热点话题发现方法存在准确率低,计算复杂,聚类中心难以确定等问题。提出一种相关向量机(RVM)优化VSM的微博文本向量化方法,首先利用RVM的自适应特征选择能力对VSM特征向量进行降维,然后利用主成分分析(PCA)方法确定K-均值算法的初始聚类中心,进而采用K-均值算法得到聚类结果,最后根据微博转发、评论和高影响力用户数量定义热度指数,热度指数最大的话题即为当前热点话题。采用实际微博文本数据集开展实验,结果表明所提方法相对于2种传统方法的准确率分别提升7.3%和1.1%,实时性分别提升45%和53%。Micro-blog text data is high-dimensional,bearing the obvious features of synonymy and polysemy.Traditional topic detection method based on Vector Space Model(VSM)combined with Kmeans has some problems such as low accuracy,complex calculation,and being difficult to determine the center of clustering.A Relevance Vector Machine(RVM)optimized VSM method is proposed to realize the text vectorization.Firstly,the dimension of VSM feature vector is reduced automatically by using the adaptive feature selection ability of RVM,and then Principal Component Analysis(PCA)is applied to determine the cluster center of K-means clustering algorithm.K-means algorithm is employed to get the clustering results.Finally,according to the number of micro-blog forwarding and comments,the topic with the largest heat index is the current hot topic.The results show that compared with two traditional methods,the accuracy of the proposed method is improved by 7.3%and 1.1%,and the real-time performance is improved by 45%and 53%,respectively.

关 键 词:热点话题发现 向量空间模型 话题聚类 数据降维 微博 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象