BERT编码与注意力机制结合的长文本分类研究  被引量:1

Research on long text classification based on the combination of BERT feature representation and attention mechanism

在线阅读下载全文

作  者:陈洁[1] Chen Jie(School of Data Science and Information Technology,China Women's University,Beijing 100101,China)

机构地区:[1]中华女子学院数据科学与信息技术学院,北京100101

出  处:《计算机时代》2023年第5期136-139,144,共5页Computer Era

基  金:中华女子学院科研基金(ZKY200020228)。

摘  要:预训练语言模型具有强大的特征表达能力但无法直接应用于长文本。为此,提出分层特征提取方法。在BERT允许的最大序列长度范围内按句子的自然边界分割文本,应用自注意力机制获得首块和尾块的增强特征,再利用PCA算法进行压缩获取主要特征成分。在THUCNews和Sogou数据集上进行5折交叉验证,分类准确率和加权F1-score的均值分别达到95.29%、95.28%和89.68%、89.69%。该方法能够提取与主题最相关的特征,提高长文本分类效果,PCA压缩特征向量能够降低分类模型的复杂度,提高时间效率。The pre-trained language models have strong feature expression ability,but could not be applied to long text directly.A hierarchical feature extraction method is proposed for this purpose.Within the maximum sequence length allowed by BERT,the text is segmented into blocks according to the natural boundary of the sentence.The self-attention mechanism is applied to obtain the enhanced features of the first block and the last block.Then PCA algorithm is used to compress the initial feature vector to obtain the main feature components.The 5-fold cross validation is carried out on THUCNews and Sogou datasets,and the mean values of the classification accuracy and weighted F1-score on the two datasets are 95.29%,95.28%and 89.68%,89.69%,respectively.The proposed classification model can extract the text features most related to the topic and improve the classification effect of long text.PCA compression feature vector can reduce the model complexity and improve time efficiency.

关 键 词:文本分类 预训练语言模型 注意力机制 特征向量 PCA 

分 类 号:TP391.1[自动化与计算机技术—计算机应用技术]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象