检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:郭金鑫 张广婷 张云泉[2] 陈泽华 贾海鹏[2] GUO Jinxin;ZHANG Guangting;ZHANG Yunquan;CHEN Zehua;JIA Haipeng(College of Data Science,Taiyuan University of Technology,Taiyuan 030024,China;State Key Laboratory of Computer Architecture,Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China)
机构地区:[1]太原理工大学大数据学院,太原030024 [2]中国科学院计算技术研究所计算机体系结构国家重点实验室,北京100190
出 处:《计算机科学与探索》2022年第6期1304-1315,共12页Journal of Frontiers of Computer Science and Technology
基 金:国家重点研发计划(2017YFB0202105,2016YFB0200803,2017YFB0202302);国家自然科学基金(61972376);北京市自然科学基金(L182053)。
摘 要:快速傅里叶变换(FFT)算法是处理器基础软件生态的重要组成部分,在工程、科学、物理和数学等领域的应用十分广泛,且这些领域对FFT算法的性能也提出了越来越高的要求。研究FFT算法在ARMv8和X86-64上的高性能实现特别是大基高性能的实现,提高FFT算法的计算性能日益重要。针对ARMv8和X86-64计算平台的架构特征,研究FFT算法的高性能实现和优化方法。通过蝶形网络优化、大基网络级数降低、大基蝶形计算优化、SIMD汇编优化以及寄存器使用策略优化等方法的应用,有效提升了FFT算法的性能,特别是提升了FFT大基的计算性能,解决了寄存器不够用的性能瓶颈,并最终总结了一套Cooley-Tukey FFT算法的高性能实现策略和优化方案。实验结果表明,在ARM、X86-64处理器上,实现的FFT算法,较ARMPL、Intel MKL和FFTW性能有明显提升,较中小基性能也有明显提升。The fast Fourier transform(FFT)algorithm is considered as an important element of the processor’s basic software ecology,and it is widely applied in the field of engineering,science,physics and mathematics.Meanwhile,the requirements for the performance of FFT in these applications are also continuously rising.Therefore,it is of definite significance to study the high-performance implementation of FFT algorithm,especially the high-performance implementation of large radices of FFT in ARMv8 and X86-64,and to improve the calculation performance of FFT algorithm.In view of the architectural features of the ARMv8 and X86-64 computing platforms,this paper studies the high-performance implementation and optimization methods of the FFT algorithm.Through the application of butterfly network optimization,large radices network stages decrease,large radices butterfly computation optimization,SIMD(single instruction multiple data)assembly optimization,and register usage optimization methods,this paper effectively improves the performance of the FFT algorithm,considerably improves the calculation performance of the large radices of FFT,and solves the performance bottlenecks of insufficiency of register resources.Lastly,the summary of a set of Cooley-Tukey FFT algorithm high-performance implementation strategies and optimization solutions is made.The experimental results indicate that for the ARM and X86-64 processors,the FFT algorithm implemented can achieve a significant improvement in performance compared with ARMPL(ARM performance library),Intel MKL(math kernel library)and FFTW(fastest Fourier transform in the West)and can achieve a significant improvement in performance compared with small and medium radices.
关 键 词:快速傅里叶变换(FFT) ARMv8 X86-64 FFTW SIMD优化
分 类 号:TP311[自动化与计算机技术—计算机软件与理论]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:18.118.253.134