检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:龚鸣清 叶煌[1] 张鉴[1] 卢兴敬 陈伟 GONG Mingqing;YE Huang;ZHANG Jian;LU Xingjing;CHEN Wei(Computer Network Information Center,Chinese Academy of Sciences,Beijing 100190,China;University of Chinese Academy Sciences,Beijing 100049,China;Beijing Sogou Technology Development Company Limited,Beijing 100084,China)
机构地区:[1]中国科学院计算机网络信息中心,北京100190 [2]中国科学院大学,北京100049 [3]北京搜狗科技发展有限公司,北京100084
出 处:《计算机应用》2019年第6期1557-1562,共6页journal of Computer Applications
基 金:国家重点研发计划项目(2016YFB0201100,2017YFB0202803);国家自然科学基金资助项目(11871454,91630204,61531166003);中国科学院战略性先导科技专项(B类)(XDB22020102);中国科学院信息化专项(XXH13506-204)~~
摘 要:针对使用ARM处理器的移动智能设备执行神经网络推理计算效率不高的问题,提出了一套基于ARMv8架构的单精度浮点通用矩阵乘法(SGEMM)算法优化方案。首先,确定ARMv8架构的处理器执行SGEMM算法的计算效率受限于向量化计算单元使用方案、指令流水线和缓存未命中的发生概率;其次,针对三点导致计算效率受限的原因实现向量指令内联汇编、数据重排和数据预取三条优化技术;最后,根据语音方向的神经网络中常见的三种矩阵模式设计测试实验,实验中使用RK3399硬件平台运行程序。实验结果表示:方阵模式下单核计算速度为10.23 GFLOPS,达到实测浮点峰值的78.2%;在细长矩阵模式下单核计算速度为6.35 GFLOPS,达到实测浮点峰值的48.1%;在连续小矩阵模式下单核计算速度为2.53 GFLOPS,达到实测浮点峰值19.2%。将优化后的SGEMM算法部署到语音识别神经网络程序中,程序的实际语音识别速度取得了显著提高。Aiming at the inefficiency of neural network inferential calculation executed by mobile intelligent devices using ARM processor, a set of Single precision floating GEneral Matrix Multiply(SGEMM) algorithm optimization scheme based on ARMv8 architecture was proposed. Firstly, it was determined that the computational efficiency of the processor based on ARMv8 architecture executing SGEMM algorithm was limited by the vectorized computation unit usage scheme, the instruction pipeline, and the probability of occurrence of cache miss. Secondly, three optimization techniques: vector instruction inline assembly, data rearrangement and data prefetching were implemented for the three reasons that the computational efficiency was limited. Finally, the test experiments were designed based on three matrix patterns commonly used in the neural network of speech direction and the programs were run on the RK3399 hardware platform. The experimental results show that, the single-core computing speed is 10.23 GFLOPS in square matrix mode, reaching 78.2% of the measured floating-point peak value;the single-core computing speed is 6.35 GFLOPS in slender matrix mode, reaching 48.1% of the measured floating-point peak value;and the single-core computing speed is 2.53 GFLOPS in continuous small matrix mode, reaching 19.2% of the measured floating-point peak value. With the optimized SGEMM algorithm deployed into the speech recognition neural network program, the actual speech recognition speed of program is significantly improved.
关 键 词:ARMv8 单指令多数据流计算 基础线性代数子程序库 高性能计算
分 类 号:TP332[自动化与计算机技术—计算机系统结构]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.49