英文文献的《中图法》分类号自动标注研究--基于文本增强与类目映射策略  被引量:3

Research on Automatic Chinese Library Classification Labeling for English Literature based on Text Data Augmentation and Classification Mapping Strategies

在线阅读下载全文

作  者:蒋彦廷 吴钰洁 JIANG YanTing;WU YuJie(Chengdu Aeronautic Polytechnic,Chengdu 610100,P.R.China;School of Chinese Language and Literature,Beijing Normal University,Beijing 100875,P.R.China)

机构地区:[1]成都航空职业技术学院,成都610100 [2]北京师范大学文学院,北京100875

出  处:《数字图书馆论坛》2022年第5期39-46,共8页Digital Library Forum

摘  要:给英文文献自动标注《中图法》分类号,能减轻图书馆与文献数据库工作人员的负担,促进跨语言知识检索与中外知识交流。面对既有的标注《中图法》分类号的英文文献数据不足的问题,本文面向预训练语言模型BERT,提出中文文献机器翻译、原始英文文本插入标点或语法词以增强分类模型泛化能力等文本增强策略,以及《美国国会图书馆分类法》到《中图法》的类目映射策略扩充文本数据。实验表明,3种策略均能有效提高文本分类效果。通过上述策略,分类的正确率与宏F1值分别提升约6.1个百分点与7.4个百分点。最后开发并发布了一个小程序,实现给英文文献自动、批量标注《中图法》20类一级分类号的功能。Automatic Chinese Library Classification labeling can reduce library or literature database staff’s burden,promote cross-lingual knowledge retrieval and knowledge communication at home and abroad.Confronting lacking of English literature annotated with Chinese Library Classification label,faced with the BERT model,this paper proposes text augmentation strategies which include Chinese literature translating to English and punctuation or grammatical words inserting to improve generalization ability of models.In addition,it proposes the classification mapping from Library of Congress Classification to Chinese Library Classification to augment text data.Experiments show that these 3 strategies can optimize the performance of text classification.After these strategies,accuracy and Macro F1 score of classification model have respectively increased by 6.1%and 7.4%.Finally,this paper developed and released a programme,which implements automatic and large-batch 20-class Chinese Library Classification labeling for English literature.

关 键 词:预训练语言模型 《中国图书馆分类法》 机器翻译 文本增强 类目映射 

分 类 号:G250.2[文化科学—图书馆学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象