TY - GEN
T1 - PhiRSA
T2 - 23rd International Conference on Selected Areas in Cryptography, SAC 2016
AU - Zhao, Yuan
AU - Pan, Wuqiong
AU - Lin, Jingqiang
AU - Liu, Peng
AU - Xue, Cong
AU - Zheng, Fangyu
N1 - Funding Information:
Y. Zhao—This work was partially supported by National 973 Program under award No. 2014CB340603 and No. 2013CB338001, Strategy Pilot Project of Chinese Academy of Sciences under award No. XDA06010702.
Publisher Copyright:
© 2017, Springer International Publishing AG.
PY - 2017
Y1 - 2017
N2 - Efficient implementations of public-key cryptographic algorithms on general-purpose computing devices, facilitate the applications of cryptography in communication security. Existing solutions work in two different directions: implementations on GPUs achieve high throughput but great latency, while those on CPUs are with low throughput and small latency. Intel Xeon Phi is the first highly parallel coprocessor of Many Integrated Core (MIC) architecture, with up, to 61 cores and one 512-bit Vector Processing Unit (VPU) in each core, which offers the potential to achieve both high throughput and small latency. In this paper, we propose a vector-oriented Montgomery multiplication design based on vector carry propagation chain (VCPC) method to fully exploit the computing power of vector instructions on Intel Xeon Phi. Two key features of our design sharply reduce the number of instructions: (1) organizing the additions in Montgomery multiplication to be four VCPCs for saving the overhead of handling carry bits; (2) computing the intermediate scalar variable q in every round without breaking the flow of VCPCs. Furthermore, we offer the optimal Montgomery multiplication implementation of our design on Intel Xeon Phi, which make VPUs fully pipelined and maintain carry bits in vector mask registers. Based on the above, we implement RSA named PhiRSA and evaluate it on Intel Xeon Phi 7120P. For 1024, 2048 and 4096-bit RSA, PhiRSA performs 258,370, 41,803 and 5,358 decryptions per second, and the latencies are 0.94, 5.84 and 45.54, ms, respectively. These results achieve 4.1 to 8.5 times performance of the existing RSA implementations on Intel Xeon Phi, exhibit high throughput comparable to those on GPUs but with much less parallel tasks, and small latency comparable to those on CPUs.
AB - Efficient implementations of public-key cryptographic algorithms on general-purpose computing devices, facilitate the applications of cryptography in communication security. Existing solutions work in two different directions: implementations on GPUs achieve high throughput but great latency, while those on CPUs are with low throughput and small latency. Intel Xeon Phi is the first highly parallel coprocessor of Many Integrated Core (MIC) architecture, with up, to 61 cores and one 512-bit Vector Processing Unit (VPU) in each core, which offers the potential to achieve both high throughput and small latency. In this paper, we propose a vector-oriented Montgomery multiplication design based on vector carry propagation chain (VCPC) method to fully exploit the computing power of vector instructions on Intel Xeon Phi. Two key features of our design sharply reduce the number of instructions: (1) organizing the additions in Montgomery multiplication to be four VCPCs for saving the overhead of handling carry bits; (2) computing the intermediate scalar variable q in every round without breaking the flow of VCPCs. Furthermore, we offer the optimal Montgomery multiplication implementation of our design on Intel Xeon Phi, which make VPUs fully pipelined and maintain carry bits in vector mask registers. Based on the above, we implement RSA named PhiRSA and evaluate it on Intel Xeon Phi 7120P. For 1024, 2048 and 4096-bit RSA, PhiRSA performs 258,370, 41,803 and 5,358 decryptions per second, and the latencies are 0.94, 5.84 and 45.54, ms, respectively. These results achieve 4.1 to 8.5 times performance of the existing RSA implementations on Intel Xeon Phi, exhibit high throughput comparable to those on GPUs but with much less parallel tasks, and small latency comparable to those on CPUs.
UR - http://www.scopus.com/inward/record.url?scp=85032654555&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85032654555&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-69453-5_26
DO - 10.1007/978-3-319-69453-5_26
M3 - Conference contribution
AN - SCOPUS:85032654555
SN - 9783319694528
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 482
EP - 500
BT - Selected Areas in Cryptography – SAC 2016 - 23rd International Conference, Revised Selected Papers
A2 - Avanzi, Roberto
A2 - Heys, Howard
PB - Springer Verlag
Y2 - 10 August 2016 through 12 August 2016
ER -