基于迭代训练的古文短文本聚类方法研究  

Research on Clustering Method of Ancient Chinese Short Texts Using Iterative Training

在线阅读下载全文

作  者:李晓璐 赵庆聪[1,3] 齐林 Li Xiaolu;Zhao Qingcong;Qi Lin(School of Information Management,Beijing Information Science and Technology University,Beijing 100192;School of Economics and Management,Beijing Information Science and Technology University,Beijing 100192;Beijing Key Laboratory of Big Data Decision-making for Green Development,Beijing 100192;Beijing World Urban Circular Economy System(Industry)Collaborative Innovation Center,Beijing 100192)

机构地区:[1]北京信息科技大学信息管理学院,北京100192 [2]北京信息科技大学经济管理学院,北京100192 [3]绿色发展大数据决策北京市重点实验室,北京100192 [4]北京世界城市循环经济体系(产业)协同创新中心,北京100192

出  处:《现代计算机》2022年第2期37-43,共7页Modern Computer

基  金:国家重点研发计划(2017YFB1400400)。

摘  要:传统短文本聚类存在特征关键词稀疏、特征维度高,且忽略文本语义等特点,基于古文《四库全书》和《太平御览》抽取的短文本词条数据集,提出了一种基于BERT+K-means+迭代训练的融合模型对短文本数据集进行聚类研究。使用BERT预训练模型来获取词条短文本的向量表示,将该向量表示作为Kmeans算法的输入得到初始聚簇结果,利用离群值检测算法将聚簇结果划分为离群值和非离群值集合,使用非离群值训练出的分类器对离群值进行再次划分,迭代进行,直至达到停止标准。将BERT词向量模型与TFIDF以及Word2vec词向量模型进行对比实验,对比结果证明BERT预训练模型相较TF-IDF和Word2vec两种词向量表示效果有显著的提升,实验还证明了迭代训练对于本文古文短文本数据集的有效性。Traditional short text clustering has the characteristics of sparse feature keywords,high feature dimensions,and ignoring text semantics. Based on the short text entry data set extracted from the ancient texts Complete Book Collection in Four Sections and Imperial Readings of the Taiping Era,a fusion model based on BERT(Bidirectional Encoder Representation from Transformers)+ K-means + iterative training is proposed to cluster the short text data sets. Use the BERT pre-training model to obtain the vector representation of the short text of the term,use the vector representation as the input of the K-means algorithm to obtain the initial clustering result,and use the outlier detection algorithm to divide the clustering result into outliers and non-outliers.Use non-outlier training to obtain a classifier,and then use the classifier to divide the outliers again,and iteratively,until the stopping criterion is reached. We compare the BERT word vector model with the TF-IDF and Word2vec word vector models. The comparison results prove that the Bert pre-training model has a significant improvement in the expression effect of the TF-IDF(Term Frequency-Inverse Document Frequency)and Word2vec word vector,and the experiment also proved the effectiveness of iterative training on the short text data set of this article.

关 键 词:古文 短文本聚类 BERT模型 K-MEANS聚类 迭代训练 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象