STK:基于对比学习嵌入的聚类方法  

STK:Clustering Method Based on Contrastive Learning Embedding

在线阅读下载全文

作  者:刘晋霞[1] 张曦 LIU Jinxia;ZHANG Xi(School of Economics and Management,Taiyuan University of Science and Technology,Taiyuan 030024,China)

机构地区:[1]太原科技大学经济与管理学院,太原030024

出  处:《计算机科学》2024年第S02期621-626,共6页Computer Science

基  金:教改项目(JG2023092)。

摘  要:SimCSE作为一种对比学习方法,在文本嵌入和聚类中表现出了良好的性能。文中旨在优化SimCSE训练模型生成的句子嵌入使其适用于聚类任务,通过多个算法组合和训练参数调整,解决聚类算法选择、噪声及异常值的影响等问题。文中提出一种联合KL散度和KMeans算法的无监督聚类模型STK(SimCSE t-SNE KMeans),使用SimCSE对文本进行编码;随后采用t-SNE算法对高维嵌入进行降维,通过最小化KL散度保留低维空间中高维数据点之间的相似性关系,降维的同时改善文本嵌入表示;最后使用KMeans算法对降维后的嵌入进行聚类,得到聚类结果。通过将本研究的聚类结果与Bert,UMAP,HDBSCAN等算法得到的结果进行比较,发现文中提出的模型在制氢领域专利和论文数据集上表现出更好的聚类效果,尤其在轮廓系数这一评价指标上。SimCSE,as a contrastive learning method,has shown good performance in text embedding and clustering.The aim of this paper is to optimize the sentence embedding generated by SimCSE training models to make them suitable for clustering tasks.By combining multiple algorithms and adjusting training parameters,the problems of clustering algorithm selection,noise,and outliers can be solved.This paper proposes an unsupervised clustering model SimCSE t-SNE KMeans(STK)that combines KL divergence and K-Means algorithm.SimCSE is used to encode the text,and then the t-SNE algorithm is used to reduce the dimensionality of high-dimensional embeddings.By minimizing KL divergence and preserving the similarity relationship between high-dimensional data points in low dimensional space,the dimensionality is reduced while improving the text embedding representation.Finally,the KMeans algorithm is used to cluster the reduced embeddings and obtain clustering results.By comparing the clustering results of this study with those obtained by algorithms such as Bert,UMAP,HDBSCAN,etc.,it is found that the model proposed in the paper showed better clustering performance in the field of hydrogen productionpatent and paper datasets,especially in the evaluation index of Silhouette coefficient.

关 键 词:SimCSE 句嵌入 KL散度 聚类 轮廓系数 

分 类 号:TP391[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象