机构地区:[1]南开大学计算机与控制工程学院,天津300071 [2]中国科学院计算技术研究所计算机体系结构国家重点实验室,北京100109
出 处:《计算机学报》2018年第10期2175-2192,共18页Chinese Journal of Computers
基 金:国家自然科学基金(61872200);天津市自然科学基金(16JCYBJC15200;17JCQNJC00300);计算机体系结构国家重点实验室开放课题(CARCH201504);天津市大数据与云计算科技重大专项(15ZXDSGX00020);高等学校博士学科点专项科研基金(20130031120029)资助
摘 要:GPU已经成为具有高并发高内存带宽的通用协处理器,但是GPU与CPU在体系结构和编程模型上存在很大差异,导致CPU-GPU异构计算系统的编程复杂度提高,即使采用统一计算设备架构(CUDA)提供的kernel并发技术和多流技术也较难充分控制和利用GPU上的计算资源,难以有效地处理不规则的并行应用问题.为从体系结构角度探索GPU硬件支持的页锁定内存和统一虚拟地址空间等特征,该文提出了CPU辅助任务调度管理下的基于线程池技术的GPU任务并行计算模型CAGTP,实现了CPU-GPU异构计算系统上的共享内存式程序设计.提出并设计了CPU端的任务队列、计算线程块级任务调度器、任务槽和GPU端的任务复用kernel函数等机制,实现了CPU与GPU间的高效细粒度任务交互,避免了原生CUDA程序中多次启停kernel函数的开销,有效地支持了GPU上的细粒度不规则并行任务计算,而且利用模型API接口函数能够降低CPU-GPU异构计算系统的编程难度.实验结果表明,CAGTP模型中任务调度的开销是kernel函数调用的5%,有效提升了通用矩阵乘、乔列斯基分解和K均值、T近邻等典型线性代数和机器学习算法的计算性能;CAGTP模型易于扩展使用多块GPU,且在性能差异较大的多个GPU之间达到负载均衡,能够高效求解混合任务和具有不规则并行性的应用问题.GPUs have become general-purpose co-processors with high concurrency and memory bandwidth.However,due to the huge difference of architecture between GPU and CPU,programming on CPU-GPU heterogeneous systems has been proved difficult and time-consuming.CUDA(Compute Unified Device Architecture)is a general purpose parallel computing platform and programming model introduced by NVIDIA.It enables thousands of threads on NVIDIA’s GPU to be developed for high performance computing,and provides convenience to leverage the parallel computing ability of GPUs to some extent.Nevertheless,it is still a problem to fully utilize GPU computational resources and reasonably schedule computational tasks running on GPUs,which has become the main bottleneck to take advantage of the parallel computing ability of GPUs and apply them to accelerate practical applications,such as matrix operations from linear algebra area and machine learning algorithms.This paper proposes CAGTP(CPU-assisted GPU thread pool),a thread pool based GPU task parallel model with the assistance of CPU in task scheduling.First,we take advantage of page-locked memory and unified virtual address space,which have been supported in new generation of GPU architectures and new versions of CUDA,to improve the communication efficiency between CPU and GPU in CAGTP.Then,we design the I/O task queues,block-level task scheduler and task slots on CPUs in CAGTP,allowing users to dynamically schedule tasks that are to be calculated on GPUs.Besides,the task-multiplexed kernel is designed on GPU,which is the core of CAGTP and can achieve the dynamic scheduling of thread blocks on GPUs.Based on these mechanism,CAGTP allows efficient scheduling of fine-grained tasks,and effectively avoids the cost of launching kernels multiple times in native CUDA programs.Moreover,CAGTP supports the calculation of irregular fine-grained parallel tasks on GPUs.Last of all,we provide several Application Programming Interfaces on CAGTP model,which can effectively reduce the complexity and time-cons
关 键 词:异构计算系统 统一计算设备架构 线程池 任务并行 任务复用函数
分 类 号:TP393[自动化与计算机技术—计算机应用技术]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...