检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:郭俊 刘鹏[3] 杨昕遥 张鲁飞 吴东 GUO Jun;LIU Peng;YANG Xinyao;ZHANG Lufei;WU Dong(School of Information Engineering and Internet of Things,Huzhou Vocational and Technical College,Huzhou 313000,China;Huzhou Key Laboratory of IoT Intelligent System Integration Technology,Huzhou Vocational and Technical College,Huzhou 313000,China;College of Information Science and Electronic Engineering,Zhejiang University,Hangzhou 310027,China;Ant Group Limited Company,Hangzhou 310013,China;State Key Laboratory of Mathematical Engineering and Advanced Computing,Wuxi 214125,China)
机构地区:[1]湖州职业技术学院信息工程与物联网学院,浙江湖州313000 [2]湖州职业技术学院湖州市物联网智能系统集成技术重点实验室,浙江湖州313000 [3]浙江大学信息与电子工程学院,浙江杭州310027 [4]蚂蚁科技集团股份有限公司,浙江杭州310013 [5]数学工程与先进计算国家重点实验室,江苏无锡214125
出 处:《浙江大学学报(工学版)》2024年第1期78-86,共9页Journal of Zhejiang University:Engineering Science
基 金:数学工程与先进计算国家重点实验室开放基金资助项目(2019A10)。
摘 要:根据“神威·太湖之光”超级计算机所用国产“申威26010”处理器的架构特点和编程规范,提出针对大点数FFT的众核并行优化方案.该方案源自经典的Cooley-Tukey FFT算法,通过将一维大点数数据迭代分解为二维小规模矩阵进行并行加速.为了解决矩阵“列FFT”的读写、转置和计算问题,提出“列均分-行连续”的读写策略,通过对数据进行合理的分配、重排、交换,结合SIMD向量化、旋转因子优化、双缓冲、寄存器通信、跨步传输等优化手段,充分利用了众核处理器的计算资源和传输带宽.实验结果显示,单核组64从核并行程序较主核运行FFTW库,可以达到最高65x、平均48x以上的加速比.A many-core parallel optimization scheme for large-point FFT was proposed according to the structural characteristics and programming specifications of the domestic Sunway 26010 processor,which was used in the Sunway Taihu Light supercomputer.The scheme was derived from the classic Cooley-Tukey FFT algorithm,and was accelerated in parallel by iteratively decomposing the one-dimensional large-point data into two-dimensional small-scale matrices.The"column-sharing,row-continuity"strategy was specially proposed in order to solve the problem of reading,writing,transposing and calculating of the"column FFT"of the matrix.The computing resources and transmission bandwidth of the many-core processor were fully utilized by reasonable data allocation,rearrangement and exchange combined with other optimization methods such as SIMD vectorization,twiddle factor optimization,double-buffering,register communication and stride transmission.The experimental results prove that the single core-group of 64 slave cores running parallel program can achieve a maximum speed-up of 65x and an average speed-up of more than 48x compared with the main core running the FFTW library.
关 键 词:神威·太湖之光 申威26010 快速傅里叶变换 Cooley-Tukey算法 众核并行
分 类 号:TP338[自动化与计算机技术—计算机系统结构]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:216.73.216.63