优化预训练模型的小语料中文文本分类方法  被引量:1

A Small-Corpus Chinese Text Classification Method for Optimizing Pre-trained Models

在线阅读下载全文

作  者:陈蓝 杨帆[1] 曾桢[1] Chen Lan;Yang Fan;Zeng Zhen(School of Information,Guizhou University of Finance and Economics,Guiyang 550000)

机构地区:[1]贵州财经大学信息学院,贵阳550000

出  处:《现代计算机》2022年第16期1-8,15,共9页Modern Computer

基  金:教育部产学合作协同育人项目(BZX1902-20):基于Jupyter Notebook的用户信息行为分析整合实验教学设计。

摘  要:针对GloVe、BERT模型生成的字向量在小语料库中表义不足的问题,提出融合向量预训练模型,对小语料中文短文本分类的精确度进行提升。本文以今日头条新闻公开数据集为实验对象,使用GloVe、BERT模型通过领域预训练,对GloVe与BERT生成的预训练字向量进行向量融合,实现语义增强,从而提升短文本分类效果。结果表明,当语料库中的数据量为500时,融合字向量的准确度相较于BERT字向量的准确度提升了5个百分点,相较于GloVe字向量的准确度提升了3个百分点。词义选取的维度待进一步加强。本文所提方法能够对小语料库的短文本数据实现精准分类,对后续文本挖掘工作具有重要意义。Aiming at the problem of insufficient representation of word vectors generated by GloVe and BERT models in small corpora,a fusion vector pre-training model was proposed to improve the accuracy of Chinese short text classification in small corpora.Taking today’s headline public data set as the experimental object,using GloVe and BERT models through domain pretraining,vector fusion of pre-trained word vectors generated by GloVe and BERT to achieve semantic enhancement,thereby improving the short text classification effect.When the amount of data in the corpus is 500,the accuracy of the fused word vector is improved by 5 percentage points compared to the accuracy of the BERT word vector,and the accuracy of the GloVe word vector is improved by 3 percentage points.The dimension of word meaning selection needs to be further strengthened.The proposed method can accurately classify short text data in small corpus,which is of great significance for subsequent text mining work.

关 键 词:BERT GLOVE 向量融合 小语料 短文本 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象