检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:陈锐 孙羽菲 程大果 郭强 陈禹乔 石昌青 隋轶丞 张宇哲 张玉志 CHEN Rui;SUN Yu-Fei;CHENG Da-Guo;GUO Qiang;CHEN Yu-Qiao;SHI Chang-Qing;SUI Yi-Cheng;ZHANG Yu-Zhe;ZHANG Yu-Zhi(College of Software,Nankai University,Tianjin 300450)
机构地区:[1]南开大学软件学院,天津300450
出 处:《计算机学报》2022年第11期2456-2474,共19页Chinese Journal of Computers
基 金:国家重点研发计划项目(2021YFB0300104)资助.
摘 要:目前,异构计算技术已经被广泛应用于人工智能领域,旨在利用以GPGPU为主的并行加速设备和CPU协同工作,更高效地完成大规模的并行计算.深度学习模型的构建、训练以及推理离不开机器学习框架的支持,但目前主流的机器学习框架基本仅支持CUDA异构编程模型.CUDA的私有性和封闭性导致机器学习框架严重依赖于英伟达GPGPU.众多其它厂商的硬件加速器,尤其是国产加速器难以充分发挥其在深度学习中的潜力.使用开源统一异构编程标准OpenCL代替私有的CUDA编程模型,是打破这一技术壁垒的有效方法.本文提出了TensorFlow中CUDA到OpenCL核函数的代码转换方案,总结整理了核函数转换的基本规则、典型难点问题的解决方法以及OpenCL核函数的性能优化等关键技术.本文首次完成了TensorFlow 2.2版本中135个OpenCL核函数的实现.经一系列测试验证,转换生成的135个OpenCL核函数能够在多种支持OpenCL标准的加速器上正确运行,优化后,近八成的OpenCL核函数在英伟达Tesla V100S上达到了与CUDA核函数相当的计算性能.测试结果验证了本文提出的CUDA到OpenCL核函数转换方案的通用性及有效性,包含OpenCL核函数的TensorFlow版本能够在直接适配跨厂商加速器设备的同时保持较好的计算性能.Heterogeneous computing technology has been widely used in artificial intelligence,aiming to use GPGPU-based parallel acceleration devices and CPUs to work together to complete large-scale artificial intelligence parallel computing tasks more efficiently.Deep learning models cannot be built,trained,and inferred without the support of machine learning frameworks.However,today’s mainstream machine learning frameworks could only support private,closed CUDA heterogeneous programming models,which leads them to rely heavily on NVIDIA GPGPUs.Many other vendors’hardware accelerators,especially domestic ones,struggle to realize their full potential in deep learning.The use of the open-source unified heterogeneous programming standard OpenCL instead of the private CUDA programming model and the porting of mainstream machine learning frameworks to domestic hardware accelerators are of great importance in breaking the technical barriers of foreign countries in the field of machine learning frameworks and hardware accelerators,promoting the application of domestic hardware accelerators in the new generation of artificial intelligence and building a software ecology based on domestic acceleration chips.Considering that implementing OpenCL kernels is the core and fundamental work for adding a novel OpenCL backend to TensorFlow.This paper investigates the code conversion scheme for CUDA to OpenCL kernels in TensorFlow.Specifically,we summarize and organize the basic rules of CUDA to OpenCL kernel conversion,the typically complex problems encountered during the conversion process,and the solutions to the typical difficulties.Meanwhile,we discuss a series of techniques for performance optimization of the OpenCL kernels generated by the conversion method in this paper.Lastly,this paper shows a method for integrating calls to OpenCL kernels in TensorFlow.To the best of our knowledge,we are the first to implement 135 OpenCL kernels in TensorFlow version 2.2.Extensive experiments are conducted on the OpenCL kernels generated by th
关 键 词:硬件加速器 异构编程环境 CUDA OPENCL TensorFlow
分 类 号:TP312[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.116.10.73