检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:刘晋霞[1] 张曦 LIU Jinxia;ZHANG Xi(School of Economics and Management,Taiyuan University of Science and Technology,Taiyuan 030024,China)
机构地区:[1]太原科技大学经济与管理学院,太原030024
出 处:《计算机科学》2024年第S02期621-626,共6页Computer Science
基 金:教改项目(JG2023092)。
摘 要:SimCSE作为一种对比学习方法,在文本嵌入和聚类中表现出了良好的性能。文中旨在优化SimCSE训练模型生成的句子嵌入使其适用于聚类任务,通过多个算法组合和训练参数调整,解决聚类算法选择、噪声及异常值的影响等问题。文中提出一种联合KL散度和KMeans算法的无监督聚类模型STK(SimCSE t-SNE KMeans),使用SimCSE对文本进行编码;随后采用t-SNE算法对高维嵌入进行降维,通过最小化KL散度保留低维空间中高维数据点之间的相似性关系,降维的同时改善文本嵌入表示;最后使用KMeans算法对降维后的嵌入进行聚类,得到聚类结果。通过将本研究的聚类结果与Bert,UMAP,HDBSCAN等算法得到的结果进行比较,发现文中提出的模型在制氢领域专利和论文数据集上表现出更好的聚类效果,尤其在轮廓系数这一评价指标上。SimCSE,as a contrastive learning method,has shown good performance in text embedding and clustering.The aim of this paper is to optimize the sentence embedding generated by SimCSE training models to make them suitable for clustering tasks.By combining multiple algorithms and adjusting training parameters,the problems of clustering algorithm selection,noise,and outliers can be solved.This paper proposes an unsupervised clustering model SimCSE t-SNE KMeans(STK)that combines KL divergence and K-Means algorithm.SimCSE is used to encode the text,and then the t-SNE algorithm is used to reduce the dimensionality of high-dimensional embeddings.By minimizing KL divergence and preserving the similarity relationship between high-dimensional data points in low dimensional space,the dimensionality is reduced while improving the text embedding representation.Finally,the KMeans algorithm is used to cluster the reduced embeddings and obtain clustering results.By comparing the clustering results of this study with those obtained by algorithms such as Bert,UMAP,HDBSCAN,etc.,it is found that the model proposed in the paper showed better clustering performance in the field of hydrogen productionpatent and paper datasets,especially in the evaluation index of Silhouette coefficient.
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.224.96.135