TY - JOUR
T1 - Gyrokinetic particle-in-cell optimization on emerging multi- and manycore platforms
AU - Madduri, Kamesh
AU - Im, Eun Jin
AU - Ibrahim, Khaled Z.
AU - Williams, Samuel
AU - Ethier, Stéphane
AU - Oliker, Leonid
N1 - Funding Information:
All authors from Lawrence Berkeley National Laboratory were supported by the DOE Office of Advanced Scientific Computing Research under Contract No. DE-AC02-05CH11231. Dr. Im was supported by Mid-career Researcher Program and by Basic Science Research Program through National Research Foundation of Korea (NRF) grant funded by the Ministry of Education, Science and Technology under Contract Nos. 2009-0083600 and 2010-0003044, and by research program 2010 of Kookmin University . Dr. Ethier was supported by the DOE Office of Fusion Energy Sciences under Contract No. DE-AC02-09CH11466. Additional support comes from Microsoft (Award #024263) and Intel (Award #024894) funding, and by matching funding by U.C. Discovery (Award #DIG07–10227). Further support comes from Par Lab affiliates National Instruments, NEC, Nokia, NVIDIA, Samsung, and Sun Microsystems. We would like to express our gratitude to Intel and Sun for their hardware donations. Access to the Istanbul and GPU resources were made possible through the DOE/ASCR Computer Science Research Testbeds program and NERSC.
PY - 2011/9
Y1 - 2011/9
N2 - The next decade of high-performance computing (HPC) systems will see a rapid evolution and divergence of multi- and manycore architectures as power and cooling constraints limit increases in microprocessor clock speeds. Understanding efficient optimization methodologies on diverse multicore designs in the context of demanding numerical methods is one of the greatest challenges faced today by the HPC community. In this work, we examine the efficient multicore optimization of GTC, a petascale gyrokinetic toroidal fusion code for studying plasma microturbulence in tokamak devices. For GTC's key computational components (charge deposition and particle push), we explore efficient parallelization strategies across a broad range of emerging multicore designs, including the recently-released Intel Nehalem-EX, the AMD Opteron Istanbul, and the highly multithreaded Sun UltraSparc T2+. We also present the first study on tuning gyrokinetic particle-in-cell (PIC) algorithms for graphics processors, using the NVIDIA C2050 (Fermi). Our work discusses several novel optimization approaches for gyrokinetic PIC, including mixed-precision computation, particle binning and decomposition strategies, grid replication, SIMDized atomic floating-point operations, and effective GPU texture memory utilization. Overall, we achieve significant performance improvements of 1.3-4.7× on these complex PIC kernels, despite the inherent challenges of data dependency and locality. Our work also points to several architectural and programming features that could significantly enhance PIC performance and productivity on next-generation architectures.
AB - The next decade of high-performance computing (HPC) systems will see a rapid evolution and divergence of multi- and manycore architectures as power and cooling constraints limit increases in microprocessor clock speeds. Understanding efficient optimization methodologies on diverse multicore designs in the context of demanding numerical methods is one of the greatest challenges faced today by the HPC community. In this work, we examine the efficient multicore optimization of GTC, a petascale gyrokinetic toroidal fusion code for studying plasma microturbulence in tokamak devices. For GTC's key computational components (charge deposition and particle push), we explore efficient parallelization strategies across a broad range of emerging multicore designs, including the recently-released Intel Nehalem-EX, the AMD Opteron Istanbul, and the highly multithreaded Sun UltraSparc T2+. We also present the first study on tuning gyrokinetic particle-in-cell (PIC) algorithms for graphics processors, using the NVIDIA C2050 (Fermi). Our work discusses several novel optimization approaches for gyrokinetic PIC, including mixed-precision computation, particle binning and decomposition strategies, grid replication, SIMDized atomic floating-point operations, and effective GPU texture memory utilization. Overall, we achieve significant performance improvements of 1.3-4.7× on these complex PIC kernels, despite the inherent challenges of data dependency and locality. Our work also points to several architectural and programming features that could significantly enhance PIC performance and productivity on next-generation architectures.
UR - http://www.scopus.com/inward/record.url?scp=80052024564&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=80052024564&partnerID=8YFLogxK
U2 - 10.1016/j.parco.2011.02.001
DO - 10.1016/j.parco.2011.02.001
M3 - Article
AN - SCOPUS:80052024564
SN - 0167-8191
VL - 37
SP - 501
EP - 520
JO - Parallel Computing
JF - Parallel Computing
IS - 9
ER -