基于Stacking-Bert集成学习的中文短文本分类算法  被引量:10

Chinese Short Text Classification Algorithm Based on Stacking-Bert Ensemble Learning

在线阅读下载全文

作  者:郑承宇 王新[1] 王婷 尹甜甜 邓亚萍 ZHENG Cheng-yu;WANG Xin;WANG Ting;YIN Tian-tian;DENG Ya-ping(School of Mathematics and Computer Science, Yunnan Minzu University, Kunming 650500, China)

机构地区:[1]云南民族大学数学与计算机科学学院,昆明650500

出  处:《科学技术与工程》2022年第10期4033-4038,共6页Science Technology and Engineering

基  金:国家自然科学基金(61363022);云南省教育厅科学研究基金(2021Y670)。

摘  要:由于word2vec、Glove等静态词向量表示方法存在无法完整表示文本语义等问题,且当前主流神经网络模型在做文本分类问题时,其预测效果往往依赖于具体问题,场景适应性差,泛化能力弱。针对上述问题,提出一种多基模型框架(Stacking-Bert)的中文短文本分类方法。模型采用BERT预训练语言模型进行文本字向量表示,输出文本的深度特征信息向量,并利用TextCNN、DPCNN、TextRNN、TextRCNN等神经网络模型构建异质多基分类器,通过Stacking集成学习获取文本向量的不同特征信息表达,以提高模型的泛化能力,最后利用支持向量机(support vector machine,SVM)作为元分类器模型进行训练和预测。与word2vec-CNN、word2vec-BiLSTM、BERT-TexCNN、BERT-DPCNN、BERT-RNN、BERT-RCNN等文本分类算法在网络公开的三个中文数据集上进行对比实验,结果表明,Stacking-Bert集成学习模型的准确率、精确率、召回率和F_(1)均为最高,能有效提升中文短文本的分类性能。Duo to the static word vector representation methods such as word2vec and Glove have problems such as incomplete representation of text semantics,and when the current mainstream neural network model is doing text classification problems,its prediction effect often depends on specific problems,the scene adaptability is poor,and the generalization ability is weak.To solve the above problems,a chinese short text classification method based on multi-base model framework named Stacking-Bert was proposed.Firstly,the model used the BERT pre-trained language model to represent text word vectors,and the deep feature information vector of the text is output.Then,the neural network models such as TextCNN,DPCNN,TextRNN,TextRCNN is used to construct a heterogeneous multi-base classifier,and obtain the text vector through Stacking integration learning Different feature information was expressed to improve the generalization ability of the model.Finally,the support vector machine was used as a meta-classifier model for training and prediction.Comparing experiments with text classification algorithms such as word2vec-CNN,word2vec-BiLSTM,BERT-texCNN,BERT-DPCNN,BERT-RNN,BERT-RCNN,etc.on three Chinese data sets published on the Internet,the results show that Stacking-Bert integrated learning.The model has the highest accuracy rate,precision rate,recall rate and F_(1) value,which can effectively improve the classification performance of chinese short texts.

关 键 词:多基模型框架 BERT预训练语言模型 Stacking集成学习 短文本分类 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象