面向脉动阵列加速器的软硬件协同容错设计  

Hardware-Software Co-design Fault-tolerant Strategies for Systolic Array Accelerators

在线阅读下载全文

作  者:魏晓辉[1] 关泽宇 王晨洋 岳恒山 吴旗[1,2] WEI Xiaohui;GUAN Zeyu;WANG Chenyang;YUE Hengshan;WU Qi(School of Computer Science and Technology,Jilin University,Changchun 130012,China;High Performance Computing Center,Jilin University,Changchun 130012,China)

机构地区:[1]吉林大学计算机科学与技术学院,长春130012 [2]吉林大学高性能计算中心,长春130012

出  处:《计算机科学》2025年第5期91-100,共10页Computer Science

基  金:国家重点研发计划(2023YFB4502304);国家自然科学基金(62302190,62272190)。

摘  要:近年来,随着模型推理精度的不断提高,卷积神经网络(CNN)在安全关键领域得到了广泛应用。为了满足CNN在实时性、高性能和低功耗计算方面的需求,领域专用架构的CNN加速器应运而生。其中,脉动阵列架构凭借其结构简单和高并行度等优势被广泛应用。然而,由于制程变异和设备老化等因素的影响,脉动阵列容易发生Stuck-At故障(SAF),进而可能导致灾难性事故。因此,制定针对脉动阵列的容错策略显得尤为重要。然而,现有的容错策略存在时间和资源开销大、网络参数修改过多等问题。为实现高效且低开销的轻量级容错策略,拟挖掘CNN的固有容错能力,对部分影响较小的SAF进行松弛处理,以减少整体容错开销。同时,充分考虑脉动阵列的计算特性,提出了行(列)交换和权重拆分两种软硬件协同容错设计,有效缓解SAF对模型推理精度的影响。实验结果表明,相较于传统行(列)跳过策略和选择保护策略,所提软硬件协同容错策略在执行效率和模型精度恢复方面更具优势。In recent years,with the continuous improvement in model inference accuracy,convolutional neural networks(CNNs)have been widely applied in safety-critical fields.To meet the demands of CNNs for real-time,high-performance,and low-power computing,domain-specific CNN accelerators is proposed.Among these,systolic array architectures have been extensively used due to their simple structure and high parallelism.However,factors such as process variations and equipment aging make systolic arrays prone to Stuck-At faults(SAF),which can lead to catastrophic accidents.Therefore,fault-tolerant strategies for systolic arrays are critically important.Existing fault-tolerant strategies,however,suffer from high time and resource costs,as well as excessive modifications to network parameters.To achieve an efficient and low-overhead lightweight fault-tolerant strategy,this paper aims to exploit the inherent fault tolerance of CNNs by relaxing the handling of minor SAFs,thereby reducing overall fault-tolerance overhead.Additionally,by fully considering the computational characteristics of systolic arrays,this paper proposes two hardware-software co-design fault-tolerant strategies:row(column)swapping and weight splitting.These strategies effectively mitigate the impact of SAF on model inference accuracy.Experimental results show that,compared to traditional row(column)bypass and selective protection strategies,the proposed hardware-software co-design fault-tolerant strategies offer superior execution efficiency and model accuracy recovery.

关 键 词:卷积神经网络 容错设计 Stuck-At故障 脉动阵列 卷积神经网络加速器 

分 类 号:TP183[自动化与计算机技术—控制理论与控制工程]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象