基于Laplacian图谱的短文本聚类算法  被引量:2

Short-Text Clustering Algorithm Based on Laplacian Graph

在线阅读下载全文

作  者:孟海宁[1,2] 冯锴 朱磊 张贝贝 童新宇[1] 黑新宏 MENG Hai-ning;FENG Kai;ZHU Lei;ZHANG Bei-bei;TONG Xin-yu;HEI Xin-hong(School of Computer Science and Engineering,Xi’an University of Technology,Xi’an,Shaanxi 710048,China;Shaanxi Key Lab Network Computer and Security Technology,Xi’an,Shaanxi 710048,China)

机构地区:[1]西安理工大学计算机科学与工程学院,陕西西安710048 [2]陕西省网络计算与安全技术重点实验室,陕西西安710048

出  处:《电子学报》2021年第9期1716-1723,共8页Acta Electronica Sinica

基  金:国家自然科学基金(No.61602375,No.61773313)。

摘  要:提出基于词频处理的Laplacian图谱聚类算法,以解决短文本数据维数高、特征稀疏等问题.首先采用词频-逆文本频率指数TF-IDF(Term Frequency-Inverse Document Frequency)方法,将短文本数据集映射到文本向量空间得到词频权值矩阵;其次利用Laplacian矩阵的图谱聚类特性,对词频权值矩阵进行数据降维处理;然后依据Laplacian矩阵的特征值表示文本相似度的特点,选择前K个特征值对应的特征向量作为初始聚类中心,以减少聚类过程的迭代次数.在SSC、20 News Group及Microblog PCU数据集上进行相关实验,结果表明Laplacian图谱聚类算法比传统聚类算法,不仅具有更优的聚类结果与更快的收敛速度,而且受噪声点影响较小,有很好的鲁棒性.A Laplacian graph clustering algorithm based on word frequency processing is presented,to solve the problems of high feature dimension and sparse feature in short text.First,the term frequency-inverse document frequency(TFIDF)method is used to map the short text dataset to the text vector space,to obtain the word frequency weight matrix.Secondly,the dimension of the word frequency weight matrix is reduced by using the graph clustering property of Laplacian matrix.Afterwards,according to the feature that the eigenvalues of Laplace matrix can represent the degree of text similarity,the eigenvectors corresponding to the first K eigenvalues are selected as the initial clustering center,thus reducing the number of iterations in the clustering process.We conduct extensive experiments on SSC,20 News Group and Microblog PCU datasets.The results show that the Laplacian graph clustering algorithm not only has better clustering results and faster convergence speed compared with the traditional clustering algorithm,but also it is less affected by noises and has good robustness.

关 键 词:Laplacian图谱 词频-逆文本频率指数 短文本聚类 向量空间模型 数据降维 特征权值 

分 类 号:TP181[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象