检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:朱良奇 黄勃[1] 黄季涛 马莉媛 史志才[1,2] ZHU Liangqi;HUANG Bo;HUANG Jitao;MA Liyuan;SHI Zhicai(School of Electronic and Electrical Engineering,Shanghai University of Engineering Science,Shanghai 201620,China;Shanghai Key Laboratory of Integrated Administration Technologies for Information Security,Shanghai 200240,China)
机构地区:[1]上海工程技术大学电子电气工程学院,上海201620 [2]上海信息安全综合管理技术重点实验室,上海200240
出 处:《计算机工程与应用》2022年第2期145-152,共8页Computer Engineering and Applications
基 金:上海市信息安全综合管理技术重点实验室开放项目(AGK2019004);松江区科学技术研究项目(19SJKJGG83)。
摘 要:短文本相比于长文本词汇的数量更少,提取其中的语义特征信息更加困难,利用传统的向量空间模型VSM(vector space model)向量化表示,容易得到高维稀疏的向量。词的稀疏表示缺少语义相关性,造成语义鸿沟,从而导致下游聚类任务中,准确率低下,容易受噪声干扰等问题。提出一种新的聚类模型BERT;E;-Means,利用预训练模型BERT(bidirectional encoder representations from transformers)作为文本表示的初始化方法,利用自动编码器AutoEncoder对文本表示向量进行自训练以提取高阶特征,将得到的特征提取器Encoder和聚类模型K-Means进行联合训练,同时优化特征提取模块和聚类模块,提高聚类模型的准确度和鲁棒性。所提出的模型在四个数据集上与Word2Vec;-Means和STC2等6个模型相比,准确率和标准互信息都有所提高,在SearchSnippet数据集上的准确率达到82.28%,实验结果显示,所提方法有效地提高了短文本聚类的准确度。Compared with long text, short text has fewer words, so it is more difficult to extract the semantic feature information. Using traditional vector space model(VSM)vectorization, it is easy to get high-dimensional sparse vector. The sparse representation of words lacks semantic relevance, which leads to semantic gap, which leads to low accuracy and noise interference in downstream clustering tasks. In view of this, a new clustering model BERT_ AE_ K-Means is proposed, using the pre training model BERT(bidirectional encoder representations from transformers). Then the AutoEncoder is used to self train the text representation vector to extract high-order features. Finally, the feature extractor Encoder and the clustering model K-Means are jointly trained. At the same time, the feature extraction module and the clustering module are optimized to improve the accuracy and robustness of the clustering model. The proposed model is compared with Word2 Vec_ K-Means, STC2 and other six models on four datasets, the accuracy and standard mutual information of K-Means are improved, and the accuracy on SearchSnippet dataset is 82.28%. Experimental results show that the proposed method can effectively improve the accuracy of short text clustering.
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.222