基于BERT模型的中文期刊文献自动分类实践研究  被引量:11

A Study on the Automatic Classification of Chinese Literature in Periodicals Based on BERT Model

在线阅读下载全文

作  者:沈立力[1] 姜鹏 王静[1] Shen Lili;Jiang Peng;Wang Jing(shanghai Library)

机构地区:[1]上海图书馆(上海科学技术情报研究所),上海200031

出  处:《图书馆杂志》2022年第5期109-118,135,共11页Library Journal

基  金:上海图书馆青年扬帆计划专项“基于深度学习的文献数字资源智能分类标引研究与应用”的研究成果之一。

摘  要:Google AI团队发布的BERT模型在多项自然语言处理任务中取得了研究成果,但在中文文献自动分类领域尚有待探索。本文旨在探索BERT;中文基础模型在中文社科、科技期刊文献分类上的实际分类效果,指出模型在实际应用中存在的问题并提出解决方法。本文选取R大类(医药、卫生)、TG大类(金属学与金属工艺)、F大类(经济)、J大类(艺术)共1 745 000条数据作为训练语料,并以另外9 610条数据作为测试样本,利用BERT模型分别对社科、科技期刊文献进行分类研究。测试结果表明BERT模型在社科文献中的四级准确率为76.95%,科技文献为68.55%。之后引入惩罚策略,为实际工作中免检数据阈值的设定提供参考。BERT;模型在《全国报刊索引》实际分类标引工作中有一定可行性,基本满足当前网络环境下中文文献自动分类的需求。The BERT model released by Google AI team has achieved results in a number of Natural Language Processing tasks.But the research in the field of automatic classification of Chinese literature remains to be explored.The purpose of this paper is to explore the actual classification effect of BERT’s Chinese basic model in the classification of Chinese social science and sci-tech periodicals,to point out the problems existing in the practical application of the model,and to propose solutions.This paper selects more than 1 745 000 Chinese documents of R category (medicine,health),TG category (metallogy and metalworking),F category (economics),and J category (art) as training corpus,and uses another 9 610 data as test samples.BERT Model is used to classify the literatures of social science and sci-tech periodicals.The results show that the four-level accuracy of BERT model is 76.95% in social science literature and 68.55% in scientific literature.Then the penalty strategy is introduced to provide reference for the threshold setting of the exemption data in practice.The BERT model can be used in the actual classification and indexing of the Quan Guo Bao Kan Suo Yin (CNBKSY) to meet the needs of automatic classification of Chinese documents under the current network environment.

关 键 词:BERT模型 深度学习 文献分类 《中国图书馆分类法》 

分 类 号:G254.1[文化科学—图书馆学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象