大点数FFT在“申威26010”上的并行优化被引量：1

Parallel optimization of large-point FFT on Sunway 26010

作　　者：郭俊刘鹏[3] 杨昕遥张鲁飞吴东 GUO Jun;LIU Peng;YANG Xinyao;ZHANG Lufei;WU Dong(School of Information Engineering and Internet of Things,Huzhou Vocational and Technical College,Huzhou 313000,China;Huzhou Key Laboratory of IoT Intelligent System Integration Technology,Huzhou Vocational and Technical College,Huzhou 313000,China;College of Information Science and Electronic Engineering,Zhejiang University,Hangzhou 310027,China;Ant Group Limited Company,Hangzhou 310013,China;State Key Laboratory of Mathematical Engineering and Advanced Computing,Wuxi 214125,China)

机构地区：[1]湖州职业技术学院信息工程与物联网学院,浙江湖州313000 [2]湖州职业技术学院湖州市物联网智能系统集成技术重点实验室,浙江湖州313000 [3]浙江大学信息与电子工程学院,浙江杭州310027 [4]蚂蚁科技集团股份有限公司,浙江杭州310013 [5]数学工程与先进计算国家重点实验室,江苏无锡214125

出　　处：《浙江大学学报（工学版）》2024年第1期78-86,共9页Journal of Zhejiang University：Engineering Science

基　　金：数学工程与先进计算国家重点实验室开放基金资助项目(2019A10)。

摘　　要：根据“神威·太湖之光”超级计算机所用国产“申威26010”处理器的架构特点和编程规范,提出针对大点数FFT的众核并行优化方案.该方案源自经典的Cooley-Tukey FFT算法,通过将一维大点数数据迭代分解为二维小规模矩阵进行并行加速.为了解决矩阵“列FFT”的读写、转置和计算问题,提出“列均分-行连续”的读写策略,通过对数据进行合理的分配、重排、交换,结合SIMD向量化、旋转因子优化、双缓冲、寄存器通信、跨步传输等优化手段,充分利用了众核处理器的计算资源和传输带宽.实验结果显示,单核组64从核并行程序较主核运行FFTW库,可以达到最高65x、平均48x以上的加速比.A many-core parallel optimization scheme for large-point FFT was proposed according to the structural characteristics and programming specifications of the domestic Sunway 26010 processor,which was used in the Sunway Taihu Light supercomputer.The scheme was derived from the classic Cooley-Tukey FFT algorithm,and was accelerated in parallel by iteratively decomposing the one-dimensional large-point data into two-dimensional small-scale matrices.The"column-sharing,row-continuity"strategy was specially proposed in order to solve the problem of reading,writing,transposing and calculating of the"column FFT"of the matrix.The computing resources and transmission bandwidth of the many-core processor were fully utilized by reasonable data allocation,rearrangement and exchange combined with other optimization methods such as SIMD vectorization,twiddle factor optimization,double-buffering,register communication and stride transmission.The experimental results prove that the single core-group of 64 slave cores running parallel program can achieve a maximum speed-up of 65x and an average speed-up of more than 48x compared with the main core running the FFTW library.

关键词：神威·太湖之光申威26010 快速傅里叶变换 Cooley-Tukey算法众核并行

分类号：TP338[自动化与计算机技术—计算机系统结构]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

大点数FFT在“申威26010”上的并行优化被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

大点数FFT在“申威26010”上的并行优化 被引量：1

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索

大点数FFT在“申威26010”上的并行优化被引量：1