基于增量文本聚类算法的热点话题检测研究

Research on hot topic detection based on incremental text clustering algorithm

作　　者：魏艺泽郭慧时晓旭 WEI Yize;GUO Hui;SHI Xiaoxu(School of Computer Science,North China Institute of Science and Technology,Yanjiao 065201,China)

机构地区：[1]华北科技学院计算机学院,北京东燕郊065201

出　　处：《华北科技学院学报》2024年第1期76-81,124,共7页Journal of North China Institute of Science and Technology

基　　金：科技创新2030重大项目(2021ZD0114203);国家社会科学基金项目(21BSH072)。

摘　　要：针对传统TF-IDF方法提取文本特征时无法增量更新以及传统Single-Pass算法聚类准确率较低的问题,本文通过使用已有的语料库来设置IDF表并更新的方法,来减少TF-IDF计算时对语料库的依赖性,通过均值计算簇中心来提高Single-Pass算法在聚类时的准确率。利用各大平台获取的新冠肺炎新闻数据对模型进行验证。结果表明,该方法使得传统的TF-IDF提取关键词时可以增量更新,利用改进的Single-Pass算法使得综合评价指标提高了8.64%。相对于传统的Single-Pass算法,改进的Single-Pass算法只需要与一部分候选簇进行比较,有效地降低了比较次数,提高了聚类的准确性以及效率。In order to address the problems of traditional TF-IDF methods not being able to incrementally update and having low accuracy when extracting text features and the traditional Single-Pass algorithm has a low clustering accuracy in traditional Single-Pass algorithm clustering this paper reduces the dependency on the corpus when calculating TF-IDF by using an existing corpus to set up IDF table and update it.It improves the accuracy of Single-Pass algorithm in clustering by computing the mean to determine cluster centers.The model is validated using COVID-19 news data obtained from various platforms.The results show that this method allows for incremental updating of traditional TF-IDF keywords extraction,and the improved Single-Pass algorithm can increase the comprehensive evaluation index by 8.64%.Compared to the traditional Single-Pass algorithm,the improved Single-Pass algorithm only needs to compare with a subset of candidate clusters,effectively reducing the number of comparisons and improving the accuracy and efficiency of clustering.

关键词：Single-Pass 文本聚类文本相似度热点话题检测 TF-IDF

分类号：TP391[自动化与计算机技术—计算机应用技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于增量文本聚类算法的热点话题检测研究

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于增量文本聚类算法的热点话题检测研究

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索