检索规则说明:AND代表“并且”;OR代表“或者”;NOT代表“不包含”;(注意必须大写,运算符两边需空一格)
检 索 范 例 :范例一: (K=图书馆学 OR K=情报学) AND A=范并思 范例二:J=计算机应用与软件 AND (U=C++ OR U=Basic) NOT M=Visual
作 者:赵新颖 袁峰 赵臻[1,3] 王保仓 ZHAO Xin-Ying;YUAN Feng;ZHAO Zhen;WANG Bao-Cang(State Key Laboratory of Integrated Services Networks,Xidian University,Xi’an 710071,China;The Institute 706 of Second Academy of China Aerospace Science and Industrial Corporation,Beijing 100854,China;Henan Key Laboratory of Network Cryptography Technology,Zhengzhou 453499,China)
机构地区:[1]西安电子科技大学空天地一体化综合业务网全国重点实验室,西安710071 [2]中国航天科工集团第二研究院706所,北京100854 [3]河南省网络密码技术重点实验室,郑州453499
出 处:《密码学报(中英文)》2024年第4期830-844,共15页Journal of Cryptologic Research
基 金:国家自然科学基金(62272362,U19B2021,62102299);河南省网络密码技术重点实验室研究课题(LNCT2022-A05);陕西高校青年创新团队。
摘 要:作为格密码算法的核心组件,环上多项式乘法的效率和准确性对于格密码方案的实用性和安全性至关重要.NTT及KNTT等现有的环上多项式乘法算法具有较高的并行性,其在CPU上运行时很难完全发挥优势.这也意味着,很多基于CPU实现的环上多项式乘法算法的效率仍有很大的提升空间.针对这一问题,本文基于Zhu等人提出的KNTT算法,利用GPU的众核特性以及强大的并行计算能力,实现了高效的环上多项式乘法运算.同时,将GPU线程模型中的线程块与KNTT算法中拆分出的小次数多项式一一对应,使得每个线程块负责一个多项式的NTT并行运算.由于GPU中的多个线程块可以被同时调度开始计算任务,因此多项式之间也可以实现并行处理,这进一步提高了KNTT算法在GPU上的实现效率.实验结果显示,GPU上实现的KNTT算法相较于NTT算法的GPU版本以及原始的CPU版本增速明显.在模多项式次数N为16384时,相对于原始C版本代码可以达到93.78%的增速.相较于GPU版本的NTT算法,在N=2048时,也可以达到40.62%的增速.As a core component of lattice-based cryptography,the efficiency and accuracy of polyno-mial multiplication over polynomial rings are crucial for the practicality and security of lattice-based cryptographic schemes.Existing polynomial multiplication algorithms,such as NTT and KNTT,have high parallelism,however they are difficult to fully leverage on CPUs.This means that there is still much room for improvement in the efficiency of ring polynomial multiplication algorithms based on CPU implementations.To address this issue,this paper proposes an efficient implementation of ring polynomial multiplication based on the KNTT algorithm proposed by Zhu et al.This implementation leverages the massive parallel nature and powerful computing capabilities of GPUs.Specifically,each thread block in the GPU thread model is mapped to one of the small-degree polynomials generated by the KNTT algorithm,with each block responsible for the NTT parallel operation of a single polyno-mial.As multiple thread blocks in the GPU can be scheduled to start computing tasks simultaneously,parallel processing of multiple polynomials is enabled,which further improves the efficiency of the KNTT algorithm on the GPU.Experimental results show that the GPU implementation of the KNTT algorithm is significantly faster than both the GPU version of the NTT algorithm and the original CPU version.For example,when the modular polynomial length N is 16384,the GPU implementation achieves a speedup of 93.78%compared to the original C version.Compared with the GPU version of the NTT algorithm,the GPU implementation of the KNTT algorithm can achieve a speedup of 40.62%when N=2048.
分 类 号:TP309.7[自动化与计算机技术—计算机系统结构]
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在载入数据...
正在链接到云南高校图书馆文献保障联盟下载...
云南高校图书馆联盟文献共享服务平台 版权所有©
您的IP:3.17.4.144