人文社科领域中文通用大模型性能评测被引量：8

Performance Evaluation of Chinese Universal Large Model in the Field of Humanities and Social Sciences

作　　者：赵志枭胡蝶刘畅[1,2] 沈思王东波[1,2] Zhao Zhixiao;Hu Die;Liu Chang;Shen Si;Wang Dongbo(College of Information Management,Nanjing Agricultural University,Nanjing 210095;Research Center of Humanities and Social Computing,Nanjing Agricultural University,Nanjing 210095;School of Economics and Management,Nanjing University of Science and Technology,Nanjing 210094)

机构地区：[1]南京农业大学信息管理学院,南京210095 [2]南京农业大学人文与社会计算研究中心,南京210095 [3]南京理工大学经济管理学院,南京210094

出　　处：《图书情报工作》2024年第13期132-143,共12页Library and Information Service

基　　金：江苏省社科基金后期资助项目“人文社会科学大语言模型构建及应用研究”(项目编号:23HQBO63)研究成果之一。

摘　　要：[目的/意义]以人文社科领域为出发点,从人文社科领域基础知识与人文社科学术文本两个方面入手进行人文社科领域模型性能比对。旨在为人文社科领域提供一份体系化的大模型评测基准,供人文社科相关领域研究人员参考。[方法/过程]设计7个人文社科领域相关的评测任务并选取对应指标,在此基础上,选取当前开源且性能较优的通用领域中文大模型,通过调用本地模型以问答形式完成领域化任务,并选取相关指标对其在人文社科领域的性能进行量化评测。[结果/结论]评测结果表明,在选取的开源模型中,无论是基座模型还是对话模型,Qwen性能最优、Baichuan2紧随其后、InternLM次之、Atom表现最差,此外,大多数情况下,相较于基座模型,对话模型表现出更加优越的性能。[Purpose/Significance]This paper Starting from the field of humanities and social sciences,this paper compares the model performance of humanities and social sciences from the aspects of basic knowledge and academic texts.It aims to provide a systematic large language model evaluation benchmark for the humanities and social sciences,and the reference for researchers in related fields.[Method/Process]Seven evaluation tasks related to the field of humanities and social sciences were designed and corresponding indicators were selected.On this basis,the current open-source and high-performance general-purpose domain Chinese large language models were selected to complete the domain-specific tasks in the form of questions and answers by invoking the local models,and their performance in humanities and social sciences was quantitatively evaluated by selecting relevant indicators.[Result/Conclusion]The evaluation results show that among the open-source models selected in this paper,Qwen has the best performance,followed by Baichuan2,InternLM,and Atom has the worst in both the base model and the dialog model.Moreover,in most cases,the dialog model shows more superior performance compared to the base model.

关键词：人文社科大模型评测领域知识学术文本

分类号：C1[社会学] TP18[自动化与计算机技术—控制理论与控制工程] TP391.1[自动化与计算机技术—控制科学与工程]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

人文社科领域中文通用大模型性能评测被引量：8

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

人文社科领域中文通用大模型性能评测 被引量：8

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

人文社科领域中文通用大模型性能评测被引量：8