针对SW26010众核处理器的单精度矩阵乘算法  

Single-precision Matrix Multiplication Algorithm Toward SW26010 Many-core Processor

在线阅读下载全文

作  者:武铮 许乐 安虹[1] 金旭[1] 文可 WU Zheng;XU Le;AN Hong;JIN Xu;WEN Ke(School of Computer Science and Technology,University of Science and Technology of China,Hefei 230027,China)

机构地区:[1]中国科学技术大学计算机科学与技术学院,合肥230027

出  处:《小型微型计算机系统》2023年第4期673-681,共9页Journal of Chinese Computer Systems

基  金:国家重点研究开发项目(2018YFB0204102)资助。

摘  要:矩阵乘作为许多科学应用中被频繁使用的关键部分,其计算量巨大且稠密的本质,使得高性能计算领域中矩阵乘并行算法的研究一直是经久不衰的热门话题.随着我国自主研发的申威众核处理器SW26010在科学计算和人工智能领域的快速发展,对面向SW26010众核处理器的高性能矩阵乘算法提出了迫切的需求.针对SW26010众核处理器的体系结构特征,首次对单精度矩阵乘实现进行了深入探讨,提出了3种不同存储层次的高性能并行算法.在进行算法设计时,计算方面,结合该处理器的从核双流水,从汇编层面手动控制核心计算任务的指令序列,保证了高效的指令级并行;访存方面,综合考虑了有限片上存储资源的有效使用,以及访存任务和计算任务的交叉并行,实现了计算访存的平衡以及算法整体性能的提升.实验结果显示,与该处理器上最先进的官方数学库xMath中的单精度矩阵乘实现相比,运行时峰值性能提升了6.8%,达到了理论峰值性能的86.17%;在基于不同矩阵乘场景的通用性比较中,95.33%的场景中性能更高,最高性能加速比达到247.9%,平均性能加速比为61.66%.As a critical component,matrix multiplication is frequently applied to many scientific applications.The parallel algorithm of matrix multiplication has always been a popular research topic in the field of high-performance computing,because of the nature of enormously dense computations.With the rapid development of Chinese home-grown many-core processor,SW26010,in scientific computation and artificial intelligence fields,there is urgent demand of high-performance matrix multiplication algorithms for SW26010 many-core processor.For the first time,this paper discusses single-precision matrix multiplication implementation in depth and proposes three different storage-level parallel algorithms based on the features of the SW26010 architecture.The computation and memory access are mainly considered during the algorithm design.For the computation,the instruction-level parallelism efficiency is ensured by manually controlling the assembly instruction sequence of primary computing tasks,combined with the two pipelines of CPE(computing processing element).For the memory access,the architecture-oriented optimization methods are introduced to efficiently employ the on-chip storage resource and hide the cost of memory access.Compared with the state-of-the-art implementation from the official library xMath,the experiments show that our work improves the runtime peak performance by 6.8%which achieves 86.17%of the theoretical peak performance.Moreover,our work outperforms the official implementation in 95.33% of different matrix multiplication cases. The best performance improvement is 247.9%,and the averageone is 61.66%. These results demonstrate that our work has both great universality and considerable performanceimprovement.

关 键 词:众核处理器 矩阵乘 计算机系统结构 高性能计算 并行算法 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象