一种极低IO带宽需求的大维度矩阵链式矩阵乘法器设计  

A large dimensional matrix chain matrix multiplier for extremely low IO bandwidth requirements

在线阅读下载全文

作  者:宋宇鲲[1] 郑强强 王泽中 张多利[1] Song Yukun;Zheng Qiangqiang;Wang Zezhong;Zhang Duoli(School of Electronic Science and Applied Physics,Hefei University of Technology,Hefei 230009,China)

机构地区:[1]合肥工业大学电子科学与应用物理学院

出  处:《电子技术应用》2019年第9期32-38,共7页Application of Electronic Technique

基  金:国家自然科学基金(61106020)

摘  要:大维度矩阵乘法常采用子矩阵分块法实现,子矩阵的最大规模决定了整个矩阵乘法执行速度。针对经典脉动结构直接处理的矩阵规模受IO带宽限制严重的问题,提出了一种极低IO带宽需求的大维度矩阵链式乘法器结构,并完成了硬件设计实现与性能验证工作。主要工作如下:(1)优化了矩阵乘法的数据组织,实现输入矩阵规模与IO带宽无关,能够最大限度地利用器件内部逻辑和存储资源;(2)根据优化后数据组织形式设计了链式乘法器硬件,实现源数据计算和传输重叠操作;(3)增强乘法器对矩阵规模的适应性,所设计的链式乘法器可实时配置为多条独立链,并行多组运算;(4)在Xilinx C7V2000T FPGA芯片上完成不同种规模的链式乘法器硬件实现和性能测试工作,在该芯片上本文提出的链式乘法器最多支持800个运算单元,是经典脉动结构规模的8倍;在相同运算器个数下,本文提出的链式乘法器只使用经典脉动结构运算1/8的IO带宽即获得相等性能。Large-dimensional matrix multiplication is often implemented by submatrix block method.The maximum size of the submatrix determines the speed of the entire matrix multiplication.Concerning the problem that the matrix size directly processed by the classical systolic structure is severely limited by the IO bandwidth,this paper proposes a large-dimensional matrix chain multiplier structure with extremely low IO bandwidth requirements,and completes the hardware design implementation and performance verification.The following is the main work of this thesis.Firstly,optimizing the data organization of matrix multiplication,realizing the input matrix size has nothing to do with IO bandwidth,and make maximum use of the internal logic and storage resources of the device.Secondly,according to the optimized data organization form,the chain multiplier hardware is designed for realizing the source data calculation and transmission overlap operation.Thirdly,the adaptability of the multiplier to the matrix scale is enhanced, and the designed chain multiplier can be configured in real time as multiple independent chains,multiple sets of operations in parallel.Lastly,completing the hardware implementation and performance test of chain multipliers of different sizes on the Xilinx C7V2000T FPGA chip.On this chip,the chain multiplier proposed in this paper supports up to 800 arithmetic units,which is 8 times the size of the classic systolic structure.In the same number of operators,the chain multiplier performance proposed in this paper uses only the classical pulsation structure to calculate the IO bandwidth of 1/8 to obtain equal performance.

关 键 词:矩阵乘 脉动 链式 IO带宽 FPGA 

分 类 号:TN47[电子电信—微电子学与固体电子学]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象