检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:李晓璐 赵庆聪[1,3] 齐林 Li Xiaolu;Zhao Qingcong;Qi Lin(School of Information Management,Beijing Information Science and Technology University,Beijing 100192;School of Economics and Management,Beijing Information Science and Technology University,Beijing 100192;Beijing Key Laboratory of Big Data Decision-making for Green Development,Beijing 100192;Beijing World Urban Circular Economy System(Industry)Collaborative Innovation Center,Beijing 100192)
机构地区:[1]北京信息科技大学信息管理学院,北京100192 [2]北京信息科技大学经济管理学院,北京100192 [3]绿色发展大数据决策北京市重点实验室,北京100192 [4]北京世界城市循环经济体系(产业)协同创新中心,北京100192
出 处:《现代计算机》2022年第2期37-43,共7页Modern Computer
基 金:国家重点研发计划(2017YFB1400400)。
摘 要:传统短文本聚类存在特征关键词稀疏、特征维度高,且忽略文本语义等特点,基于古文《四库全书》和《太平御览》抽取的短文本词条数据集,提出了一种基于BERT+K-means+迭代训练的融合模型对短文本数据集进行聚类研究。使用BERT预训练模型来获取词条短文本的向量表示,将该向量表示作为Kmeans算法的输入得到初始聚簇结果,利用离群值检测算法将聚簇结果划分为离群值和非离群值集合,使用非离群值训练出的分类器对离群值进行再次划分,迭代进行,直至达到停止标准。将BERT词向量模型与TFIDF以及Word2vec词向量模型进行对比实验,对比结果证明BERT预训练模型相较TF-IDF和Word2vec两种词向量表示效果有显著的提升,实验还证明了迭代训练对于本文古文短文本数据集的有效性。Traditional short text clustering has the characteristics of sparse feature keywords,high feature dimensions,and ignoring text semantics. Based on the short text entry data set extracted from the ancient texts Complete Book Collection in Four Sections and Imperial Readings of the Taiping Era,a fusion model based on BERT(Bidirectional Encoder Representation from Transformers)+ K-means + iterative training is proposed to cluster the short text data sets. Use the BERT pre-training model to obtain the vector representation of the short text of the term,use the vector representation as the input of the K-means algorithm to obtain the initial clustering result,and use the outlier detection algorithm to divide the clustering result into outliers and non-outliers.Use non-outlier training to obtain a classifier,and then use the classifier to divide the outliers again,and iteratively,until the stopping criterion is reached. We compare the BERT word vector model with the TF-IDF and Word2vec word vector models. The comparison results prove that the Bert pre-training model has a significant improvement in the expression effect of the TF-IDF(Term Frequency-Inverse Document Frequency)and Word2vec word vector,and the experiment also proved the effectiveness of iterative training on the short text data set of this article.
关 键 词:古文 短文本聚类 BERT模型 K-MEANS聚类 迭代训练
分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.137.202.126