面向YOLO神经网络的数据流架构优化研究  

Optimization of Dataflow Architecture for YOLO Neural Networks

在线阅读下载全文

作  者:穆宇栋 李文明[1,2] 范志华 吴萌[1,2] 吴海彬 安学军 叶笑春 范东睿[1,2] MU Yu-Dong;LI Wen-Ming;FAN Zhi-Hua;WU Meng;WU Hai-Bin;AN Xue-Jun;YE Xiao-Chun;FAN Dong-Rui(State Key Lab of Processors(Institute of Computing Technology,Chinese Academy of Sciences),Beijing 100190;School of Computer Science and Technology,University of Chinese Academy of Sciences,Beijing 100049)

机构地区:[1]处理器芯片全国重点实验室(中国科学院计算技术研究所),北京100190 [2]中国科学院大学计算机科学与技术学院,北京100049

出  处:《计算机学报》2025年第1期82-99,共18页Chinese Journal of Computers

基  金:北京市科技新星计划资助(20220484054,20230484420);北京市自然科学基金-昌平创新联合基金资助项目(L234078);中国科学院青年创新促进会资助。

摘  要:YOLO目标检测算法具有速度快、精度高、结构简单、性能稳定等优点,因此在多种对实时性要求较高的场景中得到广泛应用。传统的控制流架构在执行YOLO神经网络时面临计算部件利用率低、功耗高、能效较低等挑战。相较而言,数据流架构的执行模式与神经网络算法匹配度高,更能充分挖掘其中的数据并行性。然而,在数据流架构上部署YOLO神经网络时面临三个问题:(1)数据流架构的数据流图映射并不能结合YOLO神经网络中卷积层卷积核较小的特点,造成卷积运算数据复用率过低的问题,并进一步降低计算部件利用率;(2)数据流架构在算子调度时无法利用算子间结构高度耦合的特点,导致大量数据重复读取;(3)数据流架构上的数据存取与执行高度耦合、串序执行,导致数据存取延迟过高。为解决这些问题,本文设计了面向YOLO神经网络的数据流加速器DFU-Y。首先,结合卷积嵌套循环的执行模式,本文分析了小卷积核卷积运算的数据复用特征,并提出了更有利于执行单元内部数据复用的数据流图映射算法,从而整体提升卷积运行效率;然后,为充分利用结构耦合的算子间的数据复用,DFU-Y提出数据流图层次上的算子融合调度机制以减少数据存取次数、提升神经网络运行效率;最后,DFU-Y通过双缓存解耦合数据存取与执行,从而并行执行数据存取与运算,掩盖了程序间的数据传输延迟,提高了计算部件利用率。实验表明,相较数据流架构(DFU)和GPU(NVIDIA Xavier NX),DFU-Y分别获得2.527倍、1.334倍的性能提升和2.658倍、3.464倍的能效提升;同时,相较YOLO专用加速器(Arria-YOLO),DFU-Y在保持较好通用性的同时,达到了其性能的72.97%、能效的87.41%。The YOLO(You Only Look Once)Object Detection Algorithm stands out in the realm of computer vision for its remarkable speed,accuracy,and straightforward architecture.It's particularly favored in applications where real-time processing is crucial such as Autonomous Driving and Vehicle Detection.Despite these advantages,implementing YOLO neural networks on traditional control flow architectures presents significant drawbacks,including underutilization of computational resources,excessive energy consumption,and poor power efficiency.By contrast,dataflow architectures offer a promising alternative.The execution pattern of the dataflow architecture allows an operation to be executed once its operators are ready,which are inherently aligned with the operational demands of neural network algorithms,enabling more effective exploitation of data parallelism and reducing the redundant data transfers.Nevertheless,adapting YOLO neural networks to conventional dataflow architectures will confront with three key issues:(1)YOLO neural networks are distinguished by their reliance on convolution operations that utilize small kernels.However,traditional dataflow architectures,with their inflexible Data Flow Graph(DFG)designs,are incompetent to handle these operations efficiently.This incompatibility results in a diminished rate of data reuse,consequently leading to suboptimal utilization of computational resources.(2)The execution mechanism between operators does not effectively utilize the linked structures of specific layers in YOLO neural networks.This is particularly evident in the interaction between the convolution and activation layers,where recurrent data transfers between the Processing Element(PE)array and on-chip memory lead to substantial execution overhead and diminished energy efficiency.(3)The inherent limitations of traditional dataflow architectures,which perform data transfer and processing in a bounded and sequential manner,result in significant delays.To overcome these obstacles,we present an enhanced dataflow ar

关 键 词:YOLO算法 数据流架构 数据流图优化 卷积神经网络 神经网络加速 

分 类 号:TP301[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象