基于CWMD和SP的微博话题发现算法  

A Weibo topic discovery algorithm based on CWMD and SP

在线阅读下载全文

作  者:孙悦 罗倩[1] 方梁雨 SUN Yue;LUO Qian;FANG Liangyu(School of Information and Communication Engineering,Beijing Information Science&Technology University,Beijing 100192,China)

机构地区:[1]北京信息科技大学信息与通信工程学院,北京100192

出  处:《北京信息科技大学学报(自然科学版)》2021年第2期76-81,共6页Journal of Beijing Information Science and Technology University

基  金:中国铁道科学研究院·机车走行部状态监测系统(9151524108)。

摘  要:针对传统微博话题发现算法中,计算文本距离时仅仅考虑词与词的距离和最小而产生的问题,提出了使用CWMD(cos-word mover's distance)作为聚类标准的算法。结合余弦距离和WMD计算句子之间的相似性;使用TF-IDF向量代替WMD中词频权重向量,将所有词对文档的贡献纳入考量;使用CWMD代替传统的距离作为SP(Single-Pass)聚类的标准;并且提出了构建文本待定池的SP算法,旨在避免话题发现过程中数据到达的先后顺序对结果产生的影响,从而提高话题发现的准确性。通过对中文语料数据库中的部分数据进行对比实验,证实了该话题发现模型效果更好。进一步将该模型应用到爬取的微博数据中,将提取的簇的关键词和微博热搜话题进行比对,结果显示二者具有很强的相关性。In the traditional microblog topic discovery algorithm,only the minimum sum of the distance between words is considered when calculating the text distance.Aiming at this problem,the CWMD(cos-word mover's distance)algorithm is proposed as the standard of clustering.The algorithm combines the cosine distance and WMD to calculate the similarity of text data,uses TF-IDF to replace the word frequency weight vector in the traditional WMD to take into account the contribution of words to the document,uses CWMD instead of the traditional distance as the standard of SP(Single-Pass)clustering,and proposes SP algorithm to construct text pending pool to avoid the impact of the text arrival order in the topic discovery process,thereby improving the accuracy of topic discovery.Through comparative experiments on some data in the Chinese corpus database,it is found that the proposed topic discovery model is more effective.The model is further applied to the crawled Weibo text data by Python,and the keywords of the extracted clusters are compared with the hot topics on Weibo.The results showed a strong correlation between them.

关 键 词:词向量加权 余弦距离 词移距离 增量聚类 话题发现 

分 类 号:TP391.9[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象