出 处:《计算机学报》2018年第10期2251-2264,共14页Chinese Journal of Computers
基 金:国家自然科学基金(61572025;61472432)资助
摘 要:稠密矩阵乘法是大规模科学计算中许多算法的核心计算之一,文中提出一种高效的面向多核向量处理器的矩阵乘法向量化方法.提出一种按行计算的矩阵乘法向量化方法,该向量化方法的基本思想是每次同时计算C矩阵的一行元素,C矩阵第i行元素的值由k次向量乘累加完成,每次计算都是先将A矩阵第i行的第j个元素扩展为值相同的向量,再与B矩阵的第j行向量进行乘累加计算,每一次的向量乘累加计算是在各个VPE上并行进行,计算的源数据和结果数据均保存在VPE的本地寄存器上,每个计算结果涉及的乘累加计算均在同一个VPE上完成,并且A、B、C三个矩阵的数据均是按行顺序读取,访存效率高,在k循环结束时,同时完成C矩阵第i行元素值的计算.该方法能充分开发向量处理器的标量、向量协同数据加载能力,有效减少对DDR的存储带宽需求,能够避免低效的对乘数矩阵列向量数据的访问和各个VPE间的浮点归约求和计算,取得最优的内核计算性能;将处理器的一级数据缓存和阵列存储配置为SRAM访问模式,能够避免由于Cache数据不命中而导致的存储访问延迟,提高核心计算访问一级数据缓存和阵列存储的效率,采用组播DMA传输矩阵数据,能够显著提高从DDR读取矩阵数据的效率;提出依据向量处理单元VPE数量、VPE的FMAC运算单元数量、向量存储器的容量和矩阵元素的数据类型等向量处理器体系结构特点设计最优的核心子块矩阵分块参数设计方法,能够充分开发向量处理器的多核间数据并行、核内的多VPE间的向量SIMD并行、VPE内的多个FMAC单元并行、VPE内的标、向量指令级并行等多级并行性,并根据FMAC指令延迟槽进行完全循环展开,让内核始终以峰值速度运行;提出基于两级DMA双缓冲数据搬移策略,优化和平滑多级存储结构间的数据传输,使得DMA的数据搬移时间完全重叠于内核的计算�Dense matrix multiplication is one of the core computations in many algorithms from large scientific computing.An efficient vectorization of matrix multiplication for multi-core vector processors was presented.A vectorization of matrix multiplication according to row computation were presented.The basic idea of the vectorization method is that the one row elements of the C matrix is calculated at the same time.The value of the i-th row elements of the C matrix is completed by k vector multiply and accumulate operations.For each calculation,we extend the j th element of the i-th row of the A matrix into the vector of the same value,and then multiply and accumulate the j th row elements of the B matrix.Each vector multiply and accumulate calculation is carried out in parallel on each VPE.The calculated source data and the result data are stored in the local registers of VPE,each involved multiply and accumulate operation of calculation results are completed on the same VPE.The A,B,C matrix data are read in line order,which achieve a higher access efficiency,the calculation of the values of the i-th row element of the C matrix is completed at the end of the k cycle.This method fully exploits scalar and vector collaborative data loading capacity of vector processor and effectively reduces the storage bandwidth requirements for DDR,it avoids low efficiency data access to column vectors of multiplier matrix and float reduction summation calculation among all VPEs,and achieves optimization kernel computation performance.The level-1 data cache and array memory of vector processor was configured as SRAM access pattern,which can avoid the storage access delay caused by the cache data miss and improve the access efficiency of core computing to the level-1 data cache and array memory,it use multicast DMA to transfer matrix data,which significantly improves the efficiency of reading matrix data from DDR.An optimized core sub-block matrix blocking method was designed based on the vector processor architecture features includin
关 键 词:多核向量处理器 高性能计算 矩阵乘法 分块矩阵 向量化
分 类 号:TP391[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...