Cooley-Tukey FFT算法高性能实现与优化研究  被引量:4

High-Performance Implementation and Optimization of Cooley-Tukey FFT Algorithm

在线阅读下载全文

作  者:郭金鑫 张广婷 张云泉[2] 陈泽华 贾海鹏[2] GUO Jinxin;ZHANG Guangting;ZHANG Yunquan;CHEN Zehua;JIA Haipeng(College of Data Science,Taiyuan University of Technology,Taiyuan 030024,China;State Key Laboratory of Computer Architecture,Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China)

机构地区:[1]太原理工大学大数据学院,太原030024 [2]中国科学院计算技术研究所计算机体系结构国家重点实验室,北京100190

出  处:《计算机科学与探索》2022年第6期1304-1315,共12页Journal of Frontiers of Computer Science and Technology

基  金:国家重点研发计划(2017YFB0202105,2016YFB0200803,2017YFB0202302);国家自然科学基金(61972376);北京市自然科学基金(L182053)。

摘  要:快速傅里叶变换(FFT)算法是处理器基础软件生态的重要组成部分,在工程、科学、物理和数学等领域的应用十分广泛,且这些领域对FFT算法的性能也提出了越来越高的要求。研究FFT算法在ARMv8和X86-64上的高性能实现特别是大基高性能的实现,提高FFT算法的计算性能日益重要。针对ARMv8和X86-64计算平台的架构特征,研究FFT算法的高性能实现和优化方法。通过蝶形网络优化、大基网络级数降低、大基蝶形计算优化、SIMD汇编优化以及寄存器使用策略优化等方法的应用,有效提升了FFT算法的性能,特别是提升了FFT大基的计算性能,解决了寄存器不够用的性能瓶颈,并最终总结了一套Cooley-Tukey FFT算法的高性能实现策略和优化方案。实验结果表明,在ARM、X86-64处理器上,实现的FFT算法,较ARMPL、Intel MKL和FFTW性能有明显提升,较中小基性能也有明显提升。The fast Fourier transform(FFT)algorithm is considered as an important element of the processor’s basic software ecology,and it is widely applied in the field of engineering,science,physics and mathematics.Meanwhile,the requirements for the performance of FFT in these applications are also continuously rising.Therefore,it is of definite significance to study the high-performance implementation of FFT algorithm,especially the high-performance implementation of large radices of FFT in ARMv8 and X86-64,and to improve the calculation performance of FFT algorithm.In view of the architectural features of the ARMv8 and X86-64 computing platforms,this paper studies the high-performance implementation and optimization methods of the FFT algorithm.Through the application of butterfly network optimization,large radices network stages decrease,large radices butterfly computation optimization,SIMD(single instruction multiple data)assembly optimization,and register usage optimization methods,this paper effectively improves the performance of the FFT algorithm,considerably improves the calculation performance of the large radices of FFT,and solves the performance bottlenecks of insufficiency of register resources.Lastly,the summary of a set of Cooley-Tukey FFT algorithm high-performance implementation strategies and optimization solutions is made.The experimental results indicate that for the ARM and X86-64 processors,the FFT algorithm implemented can achieve a significant improvement in performance compared with ARMPL(ARM performance library),Intel MKL(math kernel library)and FFTW(fastest Fourier transform in the West)and can achieve a significant improvement in performance compared with small and medium radices.

关 键 词:快速傅里叶变换(FFT) ARMv8 X86-64 FFTW SIMD优化 

分 类 号:TP311[自动化与计算机技术—计算机软件与理论]

 

参考文献:

正在载入数据...

 

二级参考文献:

正在载入数据...

 

耦合文献:

正在载入数据...

 

引证文献:

正在载入数据...

 

二级引证文献:

正在载入数据...

 

同被引文献:

正在载入数据...

 

相关期刊文献:

正在载入数据...

相关的主题
相关的作者对象
相关的机构对象