检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:刘子健 王勇 刘媛妮 周由胜[1,3] LIU Zijian;WANG Yong;LIU Yuanni;ZHOU Yousheng(College of Computer and Science and Technology,Chongqing University of Posts and Telecommunications,Chongqing 400065,China;Datang Microelectronics Technology Co.,Ltd.,Beijing 100094,China;College of Cyberspace Security and Information Law,Chongqing University of Posts and Telecommunications,Chongqing 400065,China)
机构地区:[1]重庆邮电大学计算机科学与技术学院,重庆400065 [2]大唐微电子技术有限公司,北京100094 [3]重庆邮电大学网络空间安全与信息法学院,重庆400065
出 处:《计算机工程》2023年第10期145-153,共9页Computer Engineering
基 金:国家自然科学基金(62272076);重庆市自然科学基金面上项目(cstc2020jcyj-msxmX0343,cstc2020jcyj-msxmX1021);重庆市教委科学技术研究项目(KJZD-K20200602)。
摘 要:现有基于相似度的短文本流聚类算法多数需要手动设置相似度阈值,且难以处理文本稀疏性问题。针对短文本流的特点和传统流聚类算法的局限性,提出基于情节记忆的短文本流聚类算法。将情节记忆思想融入流聚类算法,通过稀疏经验重放增强聚类的特征表示,并使用反向索引提高聚类效率。在线阶段通过新的相似度计算公式以及动态计算相似度阈值,将当前文本分配到现有集群或新集群,并且不断更新聚类特征。离线阶段通过聚类增强、语义再分配以及删除过时聚类,提高整体算法性能。基于公开和合成数据集的实验结果表明,相较于基准流聚类算法,所提算法在各项评价指标上均取得了较好的实验结果,并且对于文本数量较大的数据集,运行时间能减少1~3个数量级。Most existing similarity-based short text stream clustering algorithms must manually set the similarity threshold,and it is difficult to solve the problem of text sparsity.Aiming at the characteristics of short text streams and the limitations of traditional stream clustering algorithms,a novel clustering algorithm of short text streams based on episodic memory is proposed.First,the idea of episodic memory is integrated into the stream clustering algorithm,and then,the feature representation of clustering is enhanced by sparse experience replay,and the clustering efficiency is improved by using reverse index.In the online stage,the current text is allocated to the existing cluster or new cluster via the new similarity calculation formula and the dynamic calculation of similarity threshold,and the clustering features are updated constantly.In the offline phase,the overall algorithm performance is improved through a clustering enhancement algorithm,semantic redistribution algorithm,and deleting outdated clustering algorithms.An experimental analysis based on public data sets and composite data sets shows that the proposed algorithm achieves better experimental results on various evaluation indicators compared with the benchmark stream clustering algorithms;for data sets with a large number of texts,the running time can be reduced by 1-3 orders of magnitude.
关 键 词:文本流聚类 短文本流 情节记忆 主题演化 文本特征
分 类 号:TP3[自动化与计算机技术—计算机科学与技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.248