深度学习驱动的海量人文社会科学学术文献学科分类研究  被引量:16

Study on the Discipline Classification of Massive Humanities and Social Science Academic Literature Driven by Deep Learning

在线阅读下载全文

作  者:刘江峰 林立涛 刘畅[1] 何洪旭 吴娜[1] 沈思 王东波[1] Liu Jiangfeng

机构地区:[1]南京农业大学信息管理学院,江苏南京210095 [2]南京理工大学经济管理学院,江苏南京210094

出  处:《情报理论与实践》2023年第2期71-81,共11页Information Studies:Theory & Application

基  金:国家自然科学基金项目“基于深度学习的学术全文本知识图谱构建及检索研究”的成果,项目编号:71974094。

摘  要:[目的/意义]探索不同社会科学学科间差异,支持学科建设、科技检索服务,进一步完善文献学科的分类体系。[方法/过程]基于多种深度学习模型和预训练语言模型构建社会科学文献学科分类器,利用CSSCI目录中的20多个一级学科中近350万篇文献构成的数据集进行实验;利用Sentence-BERT输出摘要句子向量并进行层次聚类,根据聚类结果划分学科组,并计算模型对于不同学科组的分类性能以缓和学科交叉的影响;利用模糊准确性指标输出模型对每条记录输出的前N个高概率学科以弥补原有学科分类的局限性。[结果/结论]在“摘要+标题”上使用深度预训练语言模型取得最佳性能;基于层次聚类所得的学科组进行的分类较单一学科性能有所提升;模型的模糊准确性在N=3时能够达到96%。[局限]未考虑从全文文本上获取更丰富的文献学科特征进行自动分类。[Purpose/significance]This paper is aimed to explore the interaction between different social science disciplines,support scientific research,discipline construction,and scientific and technological retrieval services,and to provide support for optimizing discipline category settings.[Method/process]Based on various deep learning models and pre-trained language models,several literature discipline classifiers were constructed using a dataset consisting of nearly 3500000 documents in more than 20 first-level disciplines included in the CSSCI catalog.Abstract vector of each discipline was calculated based on Sentence-BERT.Cluster analysis was conducted and discipline groups were divided according to the clustering results.On this basis,the classification performance of the model for different disciplines or discipline groups was calculated to mitigate the influence of interdisciplinary.The fuzzy accuracy method is used to output the first N high probability disciplines in different literature disciplines to make up for the limitations of the original discipline classification.[Result/conclusion]The pre-trained language model achieved the best performance on the dataset“Abstract+Title”.The classification performance of the discipline groups based on hierarchical clustering is better than that of the single discipline.The fuzzy accuracy of the model can reach 96% when N=3.[Limitations]Obtaining more abundant literature discipline features from the full text for automatic classification has not been taken into consideration.

关 键 词:文献学科分类 预训练语言模型 BERT 跨学科性 Sentence-BERT 

分 类 号:G353.1[文化科学—情报学] C1[社会学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象