基于ARMv8架构的面向机器翻译的单精度浮点通用矩阵乘法优化被引量：9

Single precision floating general matrix multiply optimization for machine translation based on ARMv8 architecture

作　　者：龚鸣清叶煌[1] 张鉴[1] 卢兴敬陈伟 GONG Mingqing;YE Huang;ZHANG Jian;LU Xingjing;CHEN Wei(Computer Network Information Center,Chinese Academy of Sciences,Beijing 100190,China;University of Chinese Academy Sciences,Beijing 100049,China;Beijing Sogou Technology Development Company Limited,Beijing 100084,China)

机构地区：[1]中国科学院计算机网络信息中心,北京100190 [2]中国科学院大学,北京100049 [3]北京搜狗科技发展有限公司,北京100084

出　　处：《计算机应用》2019年第6期1557-1562,共6页journal of Computer Applications

基　　金：国家重点研发计划项目(2016YFB0201100,2017YFB0202803);国家自然科学基金资助项目(11871454,91630204,61531166003);中国科学院战略性先导科技专项(B类)(XDB22020102);中国科学院信息化专项(XXH13506-204)~~

摘　　要：针对使用ARM处理器的移动智能设备执行神经网络推理计算效率不高的问题,提出了一套基于ARMv8架构的单精度浮点通用矩阵乘法(SGEMM)算法优化方案。首先,确定ARMv8架构的处理器执行SGEMM算法的计算效率受限于向量化计算单元使用方案、指令流水线和缓存未命中的发生概率;其次,针对三点导致计算效率受限的原因实现向量指令内联汇编、数据重排和数据预取三条优化技术;最后,根据语音方向的神经网络中常见的三种矩阵模式设计测试实验,实验中使用RK3399硬件平台运行程序。实验结果表示:方阵模式下单核计算速度为10.23 GFLOPS,达到实测浮点峰值的78.2%;在细长矩阵模式下单核计算速度为6.35 GFLOPS,达到实测浮点峰值的48.1%;在连续小矩阵模式下单核计算速度为2.53 GFLOPS,达到实测浮点峰值19.2%。将优化后的SGEMM算法部署到语音识别神经网络程序中,程序的实际语音识别速度取得了显著提高。Aiming at the inefficiency of neural network inferential calculation executed by mobile intelligent devices using ARM processor, a set of Single precision floating GEneral Matrix Multiply(SGEMM) algorithm optimization scheme based on ARMv8 architecture was proposed. Firstly, it was determined that the computational efficiency of the processor based on ARMv8 architecture executing SGEMM algorithm was limited by the vectorized computation unit usage scheme, the instruction pipeline, and the probability of occurrence of cache miss. Secondly, three optimization techniques: vector instruction inline assembly, data rearrangement and data prefetching were implemented for the three reasons that the computational efficiency was limited. Finally, the test experiments were designed based on three matrix patterns commonly used in the neural network of speech direction and the programs were run on the RK3399 hardware platform. The experimental results show that, the single-core computing speed is 10.23 GFLOPS in square matrix mode, reaching 78.2% of the measured floating-point peak value;the single-core computing speed is 6.35 GFLOPS in slender matrix mode, reaching 48.1% of the measured floating-point peak value;and the single-core computing speed is 2.53 GFLOPS in continuous small matrix mode, reaching 19.2% of the measured floating-point peak value. With the optimized SGEMM algorithm deployed into the speech recognition neural network program, the actual speech recognition speed of program is significantly improved.

关键词：ARMv8 单指令多数据流计算基础线性代数子程序库高性能计算

分类号：TP332[自动化与计算机技术—计算机系统结构]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于ARMv8架构的面向机器翻译的单精度浮点通用矩阵乘法优化被引量：9

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

基于ARMv8架构的面向机器翻译的单精度浮点通用矩阵乘法优化 被引量：9

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

基于ARMv8架构的面向机器翻译的单精度浮点通用矩阵乘法优化被引量：9