基于情节记忆的高效短文本流聚类算法

Efficient Clustering Algorithm of Short Text Streams Based on Episodic Memory

作　　者：刘子健王勇刘媛妮周由胜[1,3] LIU Zijian;WANG Yong;LIU Yuanni;ZHOU Yousheng(College of Computer and Science and Technology,Chongqing University of Posts and Telecommunications,Chongqing 400065,China;Datang Microelectronics Technology Co.,Ltd.,Beijing 100094,China;College of Cyberspace Security and Information Law,Chongqing University of Posts and Telecommunications,Chongqing 400065,China)

机构地区：[1]重庆邮电大学计算机科学与技术学院,重庆400065 [2]大唐微电子技术有限公司,北京100094 [3]重庆邮电大学网络空间安全与信息法学院,重庆400065

出　　处：《计算机工程》2023年第10期145-153,共9页Computer Engineering

基　　金：国家自然科学基金(62272076);重庆市自然科学基金面上项目(cstc2020jcyj-msxmX0343,cstc2020jcyj-msxmX1021);重庆市教委科学技术研究项目(KJZD-K20200602)。

摘　　要：现有基于相似度的短文本流聚类算法多数需要手动设置相似度阈值,且难以处理文本稀疏性问题。针对短文本流的特点和传统流聚类算法的局限性,提出基于情节记忆的短文本流聚类算法。将情节记忆思想融入流聚类算法,通过稀疏经验重放增强聚类的特征表示,并使用反向索引提高聚类效率。在线阶段通过新的相似度计算公式以及动态计算相似度阈值,将当前文本分配到现有集群或新集群,并且不断更新聚类特征。离线阶段通过聚类增强、语义再分配以及删除过时聚类,提高整体算法性能。基于公开和合成数据集的实验结果表明,相较于基准流聚类算法,所提算法在各项评价指标上均取得了较好的实验结果,并且对于文本数量较大的数据集,运行时间能减少1~3个数量级。Most existing similarity-based short text stream clustering algorithms must manually set the similarity threshold,and it is difficult to solve the problem of text sparsity.Aiming at the characteristics of short text streams and the limitations of traditional stream clustering algorithms,a novel clustering algorithm of short text streams based on episodic memory is proposed.First,the idea of episodic memory is integrated into the stream clustering algorithm,and then,the feature representation of clustering is enhanced by sparse experience replay,and the clustering efficiency is improved by using reverse index.In the online stage,the current text is allocated to the existing cluster or new cluster via the new similarity calculation formula and the dynamic calculation of similarity threshold,and the clustering features are updated constantly.In the offline phase,the overall algorithm performance is improved through a clustering enhancement algorithm,semantic redistribution algorithm,and deleting outdated clustering algorithms.An experimental analysis based on public data sets and composite data sets shows that the proposed algorithm achieves better experimental results on various evaluation indicators compared with the benchmark stream clustering algorithms;for data sets with a large number of texts,the running time can be reduced by 1-3 orders of magnitude.

关键词：文本流聚类短文本流情节记忆主题演化文本特征

分类号：TP3[自动化与计算机技术—计算机科学与技术]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于情节记忆的高效短文本流聚类算法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于情节记忆的高效短文本流聚类算法

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索