SMCA:基于芯粒集成的存算一体加速器扩展框架  

SMCA:A Framework for Scaling Chiplet-Based Computing-in-Memory Accelerators

在线阅读下载全文

作  者:李雯 王颖[4,5] 何银涛 邹凯伟 李华伟 李晓维[4,5] LI Wen;WANG Ying;HE Yintao;ZOU Kaiwei;LI Huawei;LI Xiaowei(School of Computer and Information Technology(School of Big Data),Shanxi University Taiyuan 030006,China;Institute of Big Data Science and Industry,Shanxi University Taiyuan 030006,China;Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education,Shanxi University Taiyuan 030006,China;State Key Laboratory of Processors,Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China;University of Chinese Academy of Sciences,BeiJing 100190,China;Department of Electronic Engineering,Tsinghua University,Beijing 100084,China)

机构地区:[1]山西大学计算机与信息技术学院(大数据学院),太原030006 [2]山西大学大数据科学与产业研究院,太原030006 [3]山西大学计算智能与中文信息处理教育部重点实验室,太原030006 [4]中国科学院计算技术研究所处理器芯片全国重点实验室,北京100190 [5]中国科学院大学,北京100190 [6]清华大学电子工程系,北京100084

出  处:《电子与信息学报》2024年第11期4081-4091,共11页Journal of Electronics & Information Technology

基  金:国家自然科学基金(62302283);山西省基础研究计划项目(自由探索类)(202303021212015)。

摘  要:基于可变电阻式随机存取存储器(ReRAM)的存算一体芯片已经成为加速深度学习应用的一种高效解决方案。随着智能化应用的不断发展,规模越来越大的深度学习模型对处理平台的计算和存储资源提出了更高的要求。然而,由于ReRAM器件的非理想性,基于ReRAM的大规模计算芯片面临着低良率与低可靠性的严峻挑战。多芯粒集成的芯片架构通过将多个小芯粒封装到单个芯片中,提高了芯片良率、降低了芯片制造成本,已经成为芯片设计的主要发展趋势。然而,相比于单片式芯片数据的片上传输,芯粒间的昂贵通信成为多芯粒集成芯片的性能瓶颈,限制了集成芯片的算力扩展。因此,该文提出一种基于芯粒集成的存算一体加速器扩展框架—SMCA。该框架通过对深度学习计算任务的自适应划分和基于可满足性模理论(SMT)的自动化任务部署,在芯粒集成的深度学习加速器上生成高能效、低传输开销的工作负载调度方案,实现系统性能与能效的有效提升。实验结果表明,与现有策略相比,SMCA为深度学习任务在集成芯片上自动生成的调度优化方案可以降低35%的芯粒间通信能耗。Computing-in-Memory(CiM)architectures based on Resistive Random Access Memory(ReRAM)have been recognized as a promising solution to accelerate deep learning applications.As intelligent applications continue to evolve,deep learning models become larger and larger,which imposes higher demands on the computational and storage resources on processing platforms.However,due to the non-idealism of ReRAM,large-scale ReRAM-based computing systems face severe challenges of low yield and reliability.Chiplet-based architectures assemble multiple small chiplets into a single package,providing higher fabrication yield and lower manufacturing costs,which has become a primary trend in chip design.However,compared to on-chip wiring,the expensive inter-chiplet communication becomes a performance bottleneck for chiplet-based systems which limits the chip’s scalability.As the countermeasure,a novel scaling framework for chiplet-based CiM accelerators,SMCA(SMT-based CiM chiplet Acceleration)is proposed in this paper.This framework comprises an adaptive deep learning task partition strategy and an automated SMT-based workload deployment to generate the most energy-efficient DNN workload scheduling strategy with the minimum data transmission on chiplet-based deep learning accelerators,achieving effective improvement in system performance and efficiency.Experimental results show that compared to existing strategies,the SMCA-generated automatically schedule strategy can reduce the energy costs of inter-chiplet communication by 35%.

关 键 词:芯粒 深度学习处理器 存算一体 任务调度 

分 类 号:TN40[电子电信—微电子学与固体电子学] TP389.1[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象