DeepSeek-R1是怎样炼成的?  被引量:2

How DeepSeek-R1 was created?

在线阅读下载全文

作  者:张慧敏 ZHANG Huimin

机构地区:[1]不详

出  处:《深圳大学学报(理工版)》2025年第2期226-232,共7页Journal of Shenzhen University(Science and Engineering)

摘  要:简述DeepSeek系列模型在大模型训练中的创新和优化.DeepSeek系列模型的突破主要体现在模型架构、算法创新、软硬件协同优化及整体训练效率的提升.DeepSeek-V3模型采用混合专家(mixture of experts,MoE)模型架构,通过细粒度设计和共享专家策略,实现计算资源的高效利用;MoE模型架构中的稀疏激活机制和无损负载均衡策略显著提高了模型训练的效率和性能;多头潜在注意力(multi-head latent attention,MLA)机制通过减少内存使用和加速推理过程,降低了模型训练和推理成本;通过引入多token预测(multi-token prediction,MTP)和8位浮点数(floating point 8-bit,FP8)混合精度训练技术,提升了模型的上下文理解能力和训练效率;采用优化并行线程执行(parallel thread execution,PTX)代码显著提高了图形处理器(graphics processing unit,GPU)的计算效率;所提群体相对策略优化(group relative policy optimization,GRPO)对DeepSeek-R1-Zero模型进行纯强化学习训练,跳过了传统的监督微调和人类反馈阶段,显著提升了模型的推理能力.总体而言,DeepSeek系列模型通过多项创新,在人工智能领域取得了显著优势,树立了行业新标杆.This article summarizes the innovations and optimizations in DeepSeek series models for large-scale training.The breakthroughs of DeepSeek are primarily reflected in model and algorithm innovations,software and hardware collaborative optimization,and the improvement of overall training efficiency.The DeepSeek-V3 adopts a mixture of experts(MoE)architecture,achieving efficient utilization of computing resources through fine-grained design and shared expert strategies.The sparse activation mechanism and lossless load balancing strategy in the MoE architecture significantly enhance the efficiency and performance of model training,especially when handling large-scale data and complex tasks.The innovative multi-head latent attention(MLA)mechanism reduces memory usage and accelerates the inference process,thus lowering training and inference costs.In DeepSeek-V3's training,the introduction of multi-token prediction(MTP)and 8-bit floating-point(FP8)mixed-precision training technologies improves the model's contextual understanding and training efficiency,while optimizing parallel thread execution(PTX)code significantly enhances the computation efficiency of graphics processing units(GPUs).In training the DeepSeek-R1-Zero model,group relative policy optimization(GRPO)is used for pure reinforcement learning,by passing the traditional supervised fine-tuning and human feedback stages,leading to a significant improvement in inference capabilities.Overall,DeepSeek series models has achieved significant advantages in the field of artificial intelligence through multiple innovations,setting a new industry benchmark.

关 键 词:人工智能 DeepSeek 大语言模型 混合专家模型 多头潜在注意力机制 多token预测 混合精度训练 群体相对策略优化 

分 类 号:TP18[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象