基于预训练语言模型的中文专利自动分类研究被引量：2

Research on automatic classification of Chinese patents based on pre-trained language models

作　　者：马俊吕璐成赵亚娟[2] 李聪颖 MA Jun;LV Lu-cheng;ZHAO Ya-juan;LI Cong-ying(Information Research Center of Military Sciences,Academy of Military Sciences,Beijing 100142,China;National Science Library,Chinese Academy of Sciences,Beijing 100190,China)

机构地区：[1]军事科学院军事科学信息研究中心,北京100142 [2]中国科学院文献情报中心,北京100190

出　　处：《中华医学图书情报杂志》2022年第11期20-28,共9页Chinese Journal of Medical Library and Information Science

摘　　要：目的:支撑大规模中文专利精准自动分类工作,利用改进中文专利文本表示的预训练语言模型实现专利的自动分类。方法:基于中文预训练语言模型RoBERTa,在大规模中文发明专利语料上分别使用单字遮盖策略和全词遮盖策略遮盖语言模型任务进行迁移学习,得到改进中文专利文本表示的RoBERTa模型(ZL-RoBERTa)和RoBERTa-wwm模型(ZL-RoBERTa-wwm);将模型应用到专利文本分类任务中进行实验研究,并与典型深度学习模型(Word2Vec+BiGRU+ATT+TextCNN)和当前先进的预训练语言模型BERT、RoBERTa进行对比分析。结果:基于ZL-RoBERTa和ZL-RoBERTa-wwm的中文专利自动分类模型在专利文本分类任务上的分类精准率/召回率/F1值更为突出。结论:改进文本表示的中文专利预训练语言模型用于专利文本分类具有更优效果,这为后续专利情报工作中应用预训练模型提供了模型基础。Objective To support the accurate automatic classification of large-scale Chinese patents,this paper explored the use of pre-trained language models that improved the text representation of Chinese patents to achieve automatic classification.Methods Based on the Chinese RoBERTa model,the RoBERTa model(ZL-RoBERTa)and RoBERTa-wwm model(ZL-RoBERTa-wwm)for improving the Chinese Patent text representation are obtained by using the Masked Language Model tasks of Single-word Masking strategy and Whole Word Masking strategy respectively for transfer learning on a large-scale Chinese invention patent corpus.The model was applied to the patent text classification tasks for experimental study and compared with typical deep learning models(Word2Vec+BiGRU+ATT+TextCNN)and current state-of-the-art pre-trained language models BERT and RoBERTa for analysis.Results The classification Precision/Recall/F1 values of ZL-RoBERTa-based and ZL-RoBERTa-wwm-based Chinese patent automatic classification models were more outstanding on patent text classification tasks.Conclusion The Chinese patent pre-trained language model with improved text representation is more effective for patent text classification,which provides a model basis for the subsequent application of pre-trained language models in patent intelligence work.

关键词：中文专利文本表示预训练语言模型文本分类

分类号：G254.1[文化科学—图书馆学] G306

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于预训练语言模型的中文专利自动分类研究被引量：2

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于预训练语言模型的中文专利自动分类研究 被引量：2

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于预训练语言模型的中文专利自动分类研究被引量：2