TensorFlow中OpenCL核函数的实现与优化被引量：3

Implementation and Optimization of OpenCL Kernels in TensorFlow

作　　者：陈锐孙羽菲程大果郭强陈禹乔石昌青隋轶丞张宇哲张玉志 CHEN Rui;SUN Yu-Fei;CHENG Da-Guo;GUO Qiang;CHEN Yu-Qiao;SHI Chang-Qing;SUI Yi-Cheng;ZHANG Yu-Zhe;ZHANG Yu-Zhi(College of Software,Nankai University,Tianjin 300450)

机构地区：[1]南开大学软件学院,天津300450

出　　处：《计算机学报》2022年第11期2456-2474,共19页Chinese Journal of Computers

基　　金：国家重点研发计划项目(2021YFB0300104)资助.

摘　　要：目前,异构计算技术已经被广泛应用于人工智能领域,旨在利用以GPGPU为主的并行加速设备和CPU协同工作,更高效地完成大规模的并行计算.深度学习模型的构建、训练以及推理离不开机器学习框架的支持,但目前主流的机器学习框架基本仅支持CUDA异构编程模型.CUDA的私有性和封闭性导致机器学习框架严重依赖于英伟达GPGPU.众多其它厂商的硬件加速器,尤其是国产加速器难以充分发挥其在深度学习中的潜力.使用开源统一异构编程标准OpenCL代替私有的CUDA编程模型,是打破这一技术壁垒的有效方法.本文提出了TensorFlow中CUDA到OpenCL核函数的代码转换方案,总结整理了核函数转换的基本规则、典型难点问题的解决方法以及OpenCL核函数的性能优化等关键技术.本文首次完成了TensorFlow 2.2版本中135个OpenCL核函数的实现.经一系列测试验证,转换生成的135个OpenCL核函数能够在多种支持OpenCL标准的加速器上正确运行,优化后,近八成的OpenCL核函数在英伟达Tesla V100S上达到了与CUDA核函数相当的计算性能.测试结果验证了本文提出的CUDA到OpenCL核函数转换方案的通用性及有效性,包含OpenCL核函数的TensorFlow版本能够在直接适配跨厂商加速器设备的同时保持较好的计算性能.Heterogeneous computing technology has been widely used in artificial intelligence,aiming to use GPGPU-based parallel acceleration devices and CPUs to work together to complete large-scale artificial intelligence parallel computing tasks more efficiently.Deep learning models cannot be built,trained,and inferred without the support of machine learning frameworks.However,today’s mainstream machine learning frameworks could only support private,closed CUDA heterogeneous programming models,which leads them to rely heavily on NVIDIA GPGPUs.Many other vendors’hardware accelerators,especially domestic ones,struggle to realize their full potential in deep learning.The use of the open-source unified heterogeneous programming standard OpenCL instead of the private CUDA programming model and the porting of mainstream machine learning frameworks to domestic hardware accelerators are of great importance in breaking the technical barriers of foreign countries in the field of machine learning frameworks and hardware accelerators,promoting the application of domestic hardware accelerators in the new generation of artificial intelligence and building a software ecology based on domestic acceleration chips.Considering that implementing OpenCL kernels is the core and fundamental work for adding a novel OpenCL backend to TensorFlow.This paper investigates the code conversion scheme for CUDA to OpenCL kernels in TensorFlow.Specifically,we summarize and organize the basic rules of CUDA to OpenCL kernel conversion,the typically complex problems encountered during the conversion process,and the solutions to the typical difficulties.Meanwhile,we discuss a series of techniques for performance optimization of the OpenCL kernels generated by the conversion method in this paper.Lastly,this paper shows a method for integrating calls to OpenCL kernels in TensorFlow.To the best of our knowledge,we are the first to implement 135 OpenCL kernels in TensorFlow version 2.2.Extensive experiments are conducted on the OpenCL kernels generated by th

关键词：硬件加速器异构编程环境 CUDA OPENCL TensorFlow

分类号：TP312[自动化与计算机技术—计算机软件与理论]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

TensorFlow中OpenCL核函数的实现与优化被引量：3

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

TensorFlow中OpenCL核函数的实现与优化 被引量：3

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

TensorFlow中OpenCL核函数的实现与优化被引量：3