基于大模型微调的生成式文献层次分类标引  

Generative and Hierarchical Classification of Literature Based on Fine-tuned Large Language Models

在线阅读下载全文

作  者:胡忠义 税典程 吴江[1,2] Hu Zhongyi;Shui Diancheng;Wu Jiang(School of Information Management,Wuhan University,Wuhan 430072;Center for E-commerce Research and Development,Wuhan University,Wuhan 430072)

机构地区:[1]武汉大学信息管理学院,武汉430072 [2]武汉大学电子商务研究与发展中心,武汉430072

出  处:《情报学报》2025年第4期425-437,共13页Journal of the China Society for Scientific and Technical Information

基  金:国家自然科学基金重点项目“网络视角下乡村产业互联网的数智赋能研究”(72232006);国家自然科学基金项目“数据交易场所的功能定位、运营机制与治理机制研究”(72442030)。

摘  要:对文献进行自动的分类标引,有利于实现文献的分类存储、排列和检索。已有研究通常采用判别式模型对文献的浅层类别进行自动识别,而在深层次类别划分和准确性方面能力不足。鉴于此,本研究将文献的层次分类问题转换为文献层次类别标签的生成任务,并构建了基于大模型微调的生成式文献层次分类标引框架。首先,该框架采用自然语言的形式对文献的层次分类号进行标签解释;其次,采用高效微调技术对开源大语言模型进行有监督微调;最后,采用微调后的大模型直接生成文献的多层分类标签,通过标签映射得到文献的中图分类号。在经济、医药卫生和工业技术三类学科数据上进行实验检验,结果表明,有监督微调能够有效提升通用大语言模型在文献层次分类标引任务上的理解与推理能力,也取得了比传统判别式模型更好的分类性能;整合文献的摘要、题名和关键词,可以有效提升微调大语言模型的分类性能;通过对比不同参数规模的Baichuan2和Qwen1.5大模型,发现微调后的Qwen1.5-14B Chat模型表现最佳,其在一级类目上能够达到98%的分类性能,在最具挑战性的五级类目上也达到了80%的准确性;典型样例分析展示了微调后的Qwen1.5-14B-Chat具备一定的纠错能力。The automatic classification and indexing of literature facilitate efficient organization,storage,arrangement,and retrieval.Previous studies have primarily used discriminative models to automatically identify shallow categories of literature but have struggled with deep category classification.Hence,this study transforms the hierarchical classification problem of literature into a task of generating hierarchical category labels for literature and proposes a generative hierarchical classification indexing framework based on a large language model(LLM).The framework first uses natural language to label and interpret the hierarchical classification index of literature,then applies efficient fine-tuning techniques to perform supervised fine-tuning on the LLM.The fine-tuned LLM is then used to directly generate hierarchical classification labels for literature,and the Chinese Library Classification indices of literature are obtained via label mapping.The data from three disciplines,namely economics,medicine and health,and industrial technology,are used to evaluate the proposed model.Experimental results show that supervised fine-tuning can effectively improve the understanding and reasoning abilities of general LLMs for the classification and indexing of literature.Moreover,LLMs can achieve better classification performance than traditional discriminative models.By integrating the abstracts,titles,and keywords of literature,the classification performance of fine-tuned LLMs can be effectively improved.A comparison of Baichuan2 and Qwen1.5 models with different parameter sizes showed that the fine-tuned Qwen1.5-14B-Chat model performed the best,achieving 98%classification performance in the first level category and 80%accuracy in the most challenging fifth level category.A typical example analysis demonstrates that the fine-tuned Qwen1.5-14B-Chat has error correction capabilities.

关 键 词:大语言模型 文献分类标引 层次分类 《中国图书馆分类法》 

分 类 号:G254.3[文化科学—图书馆学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象