基于组合相似度动态聚类和词熵的网络话题在线检测  

Online Topic Detection Method Based on Combination Similarity Dynamic Clustering and Word Entropy

在线阅读下载全文

作  者:郭慧[1] 王亚楠[2] 王欣艳 魏艺泽 王养廷[1] Guo Hui;Wang Ya'nan;Wang Xinyan;Wei Yize;Wang Yangting(North China Institute of Science and Technology,Langfang 065201;School of Management and Economics,Hebei University of Science and Technology,Shijiazhuang 050018;Ministry of Emergency Management Big Data Center,Beijing 100013)

机构地区:[1]华北科技学院,廊坊065201 [2]河北科技大学经济管理学院,石家庄050018 [3]应急管理部大数据中心,北京100013

出  处:《情报杂志》2024年第5期159-166,共8页Journal of Intelligence

基  金:国家社会科学基金项目“重大疫情下社区健康边际及防护体系构建研究”(编号:21BSH072)研究成果。

摘  要:[研究目的]为实现网络热点话题的在线检测,提升增量式聚类算法的聚类效果,提出了基于组合相似度的动态聚类算法,同时通过计算词熵实现主题词提取和演化跟踪。[研究方法]通过CIFG-BiLSTM-CRF模型实现文本的命名实体识别,计算文本与话题的实体相似度,再取文本词向量与话题中心余弦相似度的最大值作为词向量相似度,二者结合判断文本所属话题。在聚类过程中利用时间窗口策略实现话题中心和成员文本的动态更新。同时,计算文本词熵,生成话题的词熵和列表,实现话题主题词提取和演化跟踪。实验以新冠疫情新闻为数据实现话题在线检测,并展示了话题主题词的演化和跟踪过程。[研究结论]实验表明,与传统相似度计算方法相比,组合相似度能够获得更好的聚类效果,聚类过程中提取出的话题主题词也正确地反映了原始数据的热点话题内容。[Research purpose]To achieve online detection and tracking of hot topics on the Internet and improve the clustering performance of incremental clustering algorithms,a topic detection method based on combination similarity clustering is proposed.At the same time,topic word extraction and evolution tracking are achieved by calculating word entropy.[Research method]The named entity recognition of text is achieved through the CIFG-BiLSTM-CRF model,and the entity similarity between the text and the topic is calculated.Then,the maximum of cosine similarity between the word vector and the topic center is taken as the vector similarity of the text.And the entity similarity and vector similarity are combined to determine the topic to which the text belongs.During the clustering process,a time window strategy is used to dynamically update the topic center and member texts.At the same time,the word entropy of the text is calculated to generate the word entropy sum list of topics,in order to achieve topic word extraction and evolution tracking.The experiment uses data of COVID-19 news to realize online topic detection,and presents the evolution and tracking process of topic keywords.[Research conclusion]The experiment shows that compared with traditional similarity calculation methods,combined similarity can achieve better clustering performance,and the topic keywords extracted during the clustering process also accurately reflect the topic content of the original data.

关 键 词:网络话题 在线话题检测 增量式聚类 主题词提取 组合相似度 动态聚类算法 词熵 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象