TY - GEN
T1 - Gyrokinetic toroidal simulations on leading multi- and manycore HPC systems
AU - Madduri, Kamesh
AU - Ibrahim, Khaled Z.
AU - Williams, Samuel
AU - Im, Eun Jin
AU - Ethier, Stephane
AU - Shalf, John
AU - Oliker, Leonid
PY - 2011
Y1 - 2011
N2 - The gyrokinetic Particle-in-Cell (PIC) method is a critical computational tool enabling petascale fusion simulation re-search. In this work, we present novel multi- and manycore-centric optimizations to enhance performance of GTC, a PIC-based production code for studying plasma microtur-bulence in tokamak devices. Our optimizations encompass all six GTC sub-routines and include multi-level particle and grid decompositions designed to improve multi-node parallel scaling, particle binning for improved load balance, GPU acceleration of key subroutines, and memory-centric optimizations to improve single-node scaling and reduce memory utilization. The new hybrid MPI-OpenMP and MPI-OpenMP-CUDA GTC versions achieve up to a 2× speedup over the production Fortran code on four parallel systems - clusters based on the AMD Magny-Cours, Intel Nehalem-EP, IBM BlueGene/P, and NVIDIA Fermi architectures. Finally, strong scaling experiments provide insight into parallel scalability, memory utilization, and programmability trade-offs for large-scale gyrokinetic PIC simulations, while attaining a 1.6× speedup on 49,152 XE6 cores.
AB - The gyrokinetic Particle-in-Cell (PIC) method is a critical computational tool enabling petascale fusion simulation re-search. In this work, we present novel multi- and manycore-centric optimizations to enhance performance of GTC, a PIC-based production code for studying plasma microtur-bulence in tokamak devices. Our optimizations encompass all six GTC sub-routines and include multi-level particle and grid decompositions designed to improve multi-node parallel scaling, particle binning for improved load balance, GPU acceleration of key subroutines, and memory-centric optimizations to improve single-node scaling and reduce memory utilization. The new hybrid MPI-OpenMP and MPI-OpenMP-CUDA GTC versions achieve up to a 2× speedup over the production Fortran code on four parallel systems - clusters based on the AMD Magny-Cours, Intel Nehalem-EP, IBM BlueGene/P, and NVIDIA Fermi architectures. Finally, strong scaling experiments provide insight into parallel scalability, memory utilization, and programmability trade-offs for large-scale gyrokinetic PIC simulations, while attaining a 1.6× speedup on 49,152 XE6 cores.
UR - http://www.scopus.com/inward/record.url?scp=83155188965&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=83155188965&partnerID=8YFLogxK
U2 - 10.1145/2063384.2063415
DO - 10.1145/2063384.2063415
M3 - Conference contribution
AN - SCOPUS:83155188965
SN - 9781450307710
T3 - Proceedings of 2011 SC - International Conference for High Performance Computing, Networking, Storage and Analysis
BT - Proceedings of 2011 SC - International Conference for High Performance Computing, Networking, Storage and Analysis
T2 - 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC11
Y2 - 12 November 2011 through 18 November 2011
ER -