基于国产PuDianNao芯片的向量函数库优化  

Optimization of Vector Function Library Based on Domestic PuDianNao Chip

在线阅读下载全文

作  者:杨指政 杜子东 文渊博 YANG Zhizheng;DU Zidong;WEN Yuanbo(Henan Institute of Advanced Technology,Zhengzhou University,Zhengzhou 450001;State Key Laboratory of Computer Archi-tecture,Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190;School of Computer,University of Sci-ence and Technology of China,Hefei 230026)

机构地区:[1]郑州大学河南先进技术研究院,河南郑州450001 [2]中国科学院计算技术研究所,计算机体系结构国家重点实验室,北京100190 [3]中国科学技术大学计算机学院,安徽合肥230026

出  处:《郑州大学学报(工学版)》2023年第1期31-37,共7页Journal of Zhengzhou University(Engineering Science)

基  金:国家自然科学基金资助项目(61925208);国家自然科学基金联合基金资助项目(U19B2019);中国科学院战略性先导科技专项(XDB32050200);北京智源人工智能研究院以及北京市科技新星计划项目(Z191100001119093)。

摘  要:目前国产人工智能处理器PuDianNao芯片上的向量数学函数只能依靠循环调用标量函数来实现,该方法性能比较低。基于PuDianNao芯片提出了3种优化方法。方法一为插值方法;方法二为SIMD加掩码方法;方法三基于PuDianNao的硬件阵列结构,使用VLIW指令操作阵列中的每个处理单元,封装出SIMT编程模型,提出了暴露分支范围和分支扁平化的编程方法。对以上3种方法进行精度和性能测试,对比实验结果表明,方法三具有最好的精度和性能。使用方法三实现基于国产PuDianNao芯片的向量数学函数库PuDianNao-VecMath,解决了数学函数多分支结构难以向量化的难题。该函数库精度性能较好、功能稳定、运行正确,提供的接口包括取整函数、超越函数、比较函数、激活函数等常见基础数学库函数。在精度上,将函数定义域区间全数据作为输入,运算结果和标量函数在CPU i7运行的结果进行对比。结果表明,单精度版本最大ULP值为2,半精度版本最大ULP值为1。性能与使用标量循环相比有较大提高,单精度版本相对于标量循环平均加速比平均值为18.26,最大加速比为35.90;半精度版本平均加速比平均值为15.65,最大加速比为30.11。At present,the vector math functions on the PuDianNao chip of the domestic artificial intelligence pro-cessor can only be implemented by calling scalar functions cyclically,and the performance of this method is rela-tively low.Based on the PuDianNao chip,three optimization methods were proposed.The first two were interpola-tion method and SIMD masking method.Thirdly,based on a hardware array structure on PuDianNao,VLIW in-structions were used to operate each processing unit in the array,and the SIMT programming model was encapsula-ted programmatically.The accuracy and performance of the above three methods were tested,and the experimental results showed that the third method had the best accuracy and performance.The third method was used to imple-ment the vector mathematical function library PuDianNao-VecMath based on the domestic PuDianNao chip,which solved the problem that the multi-branch structure of mathematical functions was difficult to vectorize.The function library had good precision performance,stable functions and correct operation.The provided interfaces included rounding functions,transcendental functions,comparison functions,activation functions and other common basic math library functions.In terms of precision,the entire data of the function definition domain interval was used as input,and the operation result was compared with the result of the scalar function running on the CPU i7.The re-sults showed that the maximum ULP value was 2,and the maximum ULP value of the half-precision version was 1.Compared with the use of scalar loop,the performance was greatly improved.Compared with the scalar loop,the single-precision version had an average speed-up ratio of 18.26 and a maximum speed-up ratio of 35.90.The half-precision version had an average speed-up ratio of 15.65 and a maximum speed-up ratio of 30.11.

关 键 词:向量化函数 PuDianNao-VecMath 国产人工智能处理器 暴露分支范围和分支扁平化 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象