环上多项式乘法在GPU上的优化实现

Optimal Implementation of Multiplication over Polynomial Rings on GPUs

作　　者：赵新颖袁峰赵臻[1,3] 王保仓 ZHAO Xin-Ying;YUAN Feng;ZHAO Zhen;WANG Bao-Cang(State Key Laboratory of Integrated Services Networks,Xidian University,Xi’an 710071,China;The Institute 706 of Second Academy of China Aerospace Science and Industrial Corporation,Beijing 100854,China;Henan Key Laboratory of Network Cryptography Technology,Zhengzhou 453499,China)

机构地区：[1]西安电子科技大学空天地一体化综合业务网全国重点实验室,西安710071 [2]中国航天科工集团第二研究院706所,北京100854 [3]河南省网络密码技术重点实验室,郑州453499

出　　处：《密码学报（中英文）》2024年第4期830-844,共15页Journal of Cryptologic Research

基　　金：国家自然科学基金(62272362,U19B2021,62102299);河南省网络密码技术重点实验室研究课题(LNCT2022-A05);陕西高校青年创新团队。

摘　　要：作为格密码算法的核心组件,环上多项式乘法的效率和准确性对于格密码方案的实用性和安全性至关重要.NTT及KNTT等现有的环上多项式乘法算法具有较高的并行性,其在CPU上运行时很难完全发挥优势.这也意味着,很多基于CPU实现的环上多项式乘法算法的效率仍有很大的提升空间.针对这一问题,本文基于Zhu等人提出的KNTT算法,利用GPU的众核特性以及强大的并行计算能力,实现了高效的环上多项式乘法运算.同时,将GPU线程模型中的线程块与KNTT算法中拆分出的小次数多项式一一对应,使得每个线程块负责一个多项式的NTT并行运算.由于GPU中的多个线程块可以被同时调度开始计算任务,因此多项式之间也可以实现并行处理,这进一步提高了KNTT算法在GPU上的实现效率.实验结果显示,GPU上实现的KNTT算法相较于NTT算法的GPU版本以及原始的CPU版本增速明显.在模多项式次数N为16384时,相对于原始C版本代码可以达到93.78%的增速.相较于GPU版本的NTT算法,在N=2048时,也可以达到40.62%的增速.As a core component of lattice-based cryptography,the eﬀiciency and accuracy of polyno-mial multiplication over polynomial rings are crucial for the practicality and security of lattice-based cryptographic schemes.Existing polynomial multiplication algorithms,such as NTT and KNTT,have high parallelism,however they are diﬀicult to fully leverage on CPUs.This means that there is still much room for improvement in the eﬀiciency of ring polynomial multiplication algorithms based on CPU implementations.To address this issue,this paper proposes an eﬀicient implementation of ring polynomial multiplication based on the KNTT algorithm proposed by Zhu et al.This implementation leverages the massive parallel nature and powerful computing capabilities of GPUs.Specifically,each thread block in the GPU thread model is mapped to one of the small-degree polynomials generated by the KNTT algorithm,with each block responsible for the NTT parallel operation of a single polyno-mial.As multiple thread blocks in the GPU can be scheduled to start computing tasks simultaneously,parallel processing of multiple polynomials is enabled,which further improves the eﬀiciency of the KNTT algorithm on the GPU.Experimental results show that the GPU implementation of the KNTT algorithm is significantly faster than both the GPU version of the NTT algorithm and the original CPU version.For example,when the modular polynomial length N is 16384,the GPU implementation achieves a speedup of 93.78%compared to the original C version.Compared with the GPU version of the NTT algorithm,the GPU implementation of the KNTT algorithm can achieve a speedup of 40.62%when N=2048.

关键词：格密码多项式乘法 NTT KNTT

分类号：TP309.7[自动化与计算机技术—计算机系统结构]

参考文献：

正在载入数据...

二级参考文献：

正在载入数据...

耦合文献：

正在载入数据...

引证文献：

正在载入数据...

二级引证文献：

正在载入数据...

同被引文献：

正在载入数据...

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

环上多项式乘法在GPU上的优化实现

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

高级检索检索式检索

时间限定

期刊范围

学科限定全选

高级检索 检索式检索

时间限定

期刊范围

学科限定全选

环上多项式乘法在GPU上的优化实现

我的收藏

参考文献：

二级参考文献：

耦合文献：

引证文献：

二级引证文献：

同被引文献：

相关期刊文献：

相关的主题

相关的作者对象

相关的机构对象

下载全文

用户登录

高级检索检索式检索