HXDSP上双精度矩阵向量乘运算的实现与优化  

Realization and Optimization of Double-precision Matrix Vector Multiplication Based on HXDSP

在线阅读下载全文

作  者:廖晓群 王佳仪 苏涛[2] 李敏 张美春 LIAO Xiao-qun;WANG Jia-yi;SU Tao;LI Min;ZHANG Mei-chun(School of Communication and Information Engineering,Xi’an University of Science and Technology,Xi’an 710054,China;National Lab of Radar Signal Processing,Xidian University,Xi’an 710071,China)

机构地区:[1]西安科技大学通信与信息工程学院,陕西西安710054 [2]西安电子科技大学雷达信号处理国家重点实验室,陕西西安710071

出  处:《计算机技术与发展》2021年第11期101-107,共7页Computer Technology and Development

基  金:国家科技重大专项项目(2012ZX01034001-001)。

摘  要:目前HXDSP1042编译器的编程模型已经可以支持以字节为单位的寻址模式以及64位数据的存取与运算,这对于提高浮点数据运算的精度具有重要的意义。矩阵类算法是雷达信号处理的常用运算,在自适应波束形成、方向估计中矩阵运算占有相当大的比重,现在很多DSP处理器并不能自动地充分利用自身所拥有的硬件架构,如何让编译器高效地处理矩阵类的运算变得尤为重要。HXDSP1042是一款针对数字信号处理及嵌入式应用的处理器,如何在HXDSP1042指令框架下,针对该芯片的硬件特点展开矩阵类运算的设计,是芯片走向高性能应用的重要一步。文中结合多簇VLIW指令架构的特点,基于循环展开、指令调度以及软件流水等并行优化技术,充分利用芯片内部硬件资源,对HXDSP1042芯片中的双精度浮点矩阵乘以向量运算函数实施并行优化。实验结果表明,相对于优化前的串行算法结构来说,并行优化后的函数加速比达到了11以上。At present,the programming model of the HXDSP1042 compiler can support the addressing mode in bytes and the access and operation of 64-bit data,which is of great significance for improving the accuracy of floating-point data operations.Matrix algorithms are common operations in radar signal processing,and matrix operations occupy a large proportion in adaptive beamforming and direction estimation.Now many DSP processors cannot automatically make full use of their own hardware architecture.How to make the compiler handle matrix operations efficiently becomes particularly important.HXDSP1042 is a processor for digital signal processing and embedded applications.How to design matrix operations based on the hardware characteristics of the chip under the HXDSP1042 instruction framework is an important step towards high-performance applications for the chip.In this paper,combining the characteristics of the multi-cluster VLIW instruction architecture,based on parallel optimization techniques such as loop unrolling,instruction scheduling,and software pipeline,making full use of the internal hardware resources of the chip,the double-precision floating-point matrix multiplying the vector operation function in the HXDSP1042 chip is implemented in parallel optimization.The experiment shows that compared with the serial algorithm structure before optimization,the function speedup ratio after parallel optimization reaches 11 or more.

关 键 词:多簇 单指令流多数据流 64位数据运算 软件流水 数字信号处理器 

分 类 号:TP301[自动化与计算机技术—计算机系统结构]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象