基于增量文本聚类算法的热点话题检测研究  

Research on hot topic detection based on incremental text clustering algorithm

在线阅读下载全文

作  者:魏艺泽 郭慧 时晓旭 WEI Yize;GUO Hui;SHI Xiaoxu(School of Computer Science,North China Institute of Science and Technology,Yanjiao 065201,China)

机构地区:[1]华北科技学院计算机学院,北京东燕郊065201

出  处:《华北科技学院学报》2024年第1期76-81,124,共7页Journal of North China Institute of Science and Technology

基  金:科技创新2030重大项目(2021ZD0114203);国家社会科学基金项目(21BSH072)。

摘  要:针对传统TF-IDF方法提取文本特征时无法增量更新以及传统Single-Pass算法聚类准确率较低的问题,本文通过使用已有的语料库来设置IDF表并更新的方法,来减少TF-IDF计算时对语料库的依赖性,通过均值计算簇中心来提高Single-Pass算法在聚类时的准确率。利用各大平台获取的新冠肺炎新闻数据对模型进行验证。结果表明,该方法使得传统的TF-IDF提取关键词时可以增量更新,利用改进的Single-Pass算法使得综合评价指标提高了8.64%。相对于传统的Single-Pass算法,改进的Single-Pass算法只需要与一部分候选簇进行比较,有效地降低了比较次数,提高了聚类的准确性以及效率。In order to address the problems of traditional TF-IDF methods not being able to incrementally update and having low accuracy when extracting text features and the traditional Single-Pass algorithm has a low clustering accuracy in traditional Single-Pass algorithm clustering this paper reduces the dependency on the corpus when calculating TF-IDF by using an existing corpus to set up IDF table and update it.It improves the accuracy of Single-Pass algorithm in clustering by computing the mean to determine cluster centers.The model is validated using COVID-19 news data obtained from various platforms.The results show that this method allows for incremental updating of traditional TF-IDF keywords extraction,and the improved Single-Pass algorithm can increase the comprehensive evaluation index by 8.64%.Compared to the traditional Single-Pass algorithm,the improved Single-Pass algorithm only needs to compare with a subset of candidate clusters,effectively reducing the number of comparisons and improving the accuracy and efficiency of clustering.

关 键 词:Single-Pass 文本聚类 文本相似度 热点话题检测 TF-IDF 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象