数据引擎驱动的学术出版大模型——实测检验大规模高质量数据在构建高性能模型中的核心地位  被引量:1

Data-Driven Academic Publishing Large Model——An Empirical Test of Centrality of Large-Scale and High-Quality Data in Building High-Performance Models

在线阅读下载全文

作  者:薛德军 师庆辉 毕琰虹 芦筱菲 陈婧 王旭 王海山 耿崇 吴晨 XUE Dejun;SHI Qinghui;BI Yanhong;LU Xiaofei;CHEN Jing;WANG Xu;WANG Haishan;GENG Chong;WU Chen(Tongfang Knowledge Network Digital Publishing Technology Co.,Ltd,100192,Beijing,China)

机构地区:[1]同方知网数字出版技术股份有限公司,北京100192

出  处:《数字出版研究》2024年第3期122-132,共11页DIGITAL PUBLISHING RESEARCH

基  金:国家重点研发计划“面向办案的检察机关法律监督知识融合与智能交互关键技术研究”(项目编号:2020YFC0833003);国家卓越行动计划“科技期刊数字化运营国际平台服务项目”(项目编号:WKZB1911BJM501173/02)。

摘  要:在构建高性能大模型时,大规模高质量数据的重要性不容忽视。本研究旨在深入探究这一核心要素,并系统评估其在专业领域中的实际应用效果与潜在价值。本研究基于中国知网大量专业文献,构建了一个包含1316.45亿token的学术资源数据集AcaDS和2700万条指令的下游微调数据集AcaDSI,采用Transformer架构设计并训练了一个70亿参数规模的生成式学术大模型AcaLM-7B。通过实验评测,AcaLM-7B在面向学术研究的6个核心应用场景中获得总积分第一、3个单项第一和2个单项第二,验证了大规模高质量数据资源在构建专业大模型中的核心地位。此外,本研究在数字出版行业具有实际应用价值,有利于提升内容生产效率并优化用户体验。The importance of large-scale and high-quality data is paramount in building highperforming large models.This paper delved into this core element and systematically evaluated its practical application impacts and potential value in the professional field.Based on a large number of professional literature from China National Knowledge Infrastructure(CNKI),this paper constructed an academic resource dataset,AcaDS,containing 131.645 billion tokens and a fine-tuning dataset,AcaDSI,with 27 million instructions.A generative academic large model,AcaLM-7B,with 7 billion parameters was designed and trained using the Transformer architecture.Through experimental evaluation,AcaLM-7B achieved the first place in total score and the first place in three individual categories and the second place in two individual categories in six core application scenarios for academic research,demonstrating excellent per formance and verif ying the core position of large-scale and highquality data resources in building professional large models.In addition,this paper facilitated the improvement of content production efficiency and optimization of user experience,and thus had practical application value in the digital publishing industry.

关 键 词:高质量数据 学术大模型 出版大模型 知网大模型 专业应用场景 模型评测 

分 类 号:G230.7[文化科学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象