TY - JOUR
T1 - Optimization of parallel particle-to-grid interpolation on leading multicore platforms
AU - Madduri, Kamesh
AU - Su, Jimmy
AU - Williams, Samuel
AU - Oliker, Leonid
AU - Ethier, Stéphane
AU - Yelick, Katherine
N1 - Funding Information:
All authors from Lawrence Berkeley National Laboratory were supported by the DOE Office of Advanced Scientific Computing Research under contract number DE-AC02-05CH11231. Dr. Ethier was supported by the DOE Office of Fusion Energy Sciences under contract number DE-AC02-09CH11466. Additional support comes from Microsoft (Award #024263), Intel (Award #024894), U.C. Discovery (Award #DIG07-10227), as well as Par Lab affiliates, including National Instruments, NEC, Nokia, NVIDIA, Samsung, and Sun Microsystems.
PY - 2012
Y1 - 2012
N2 - We are now in the multicore revolution which is witnessing a rapid evolution of architectural designs due to power constraints and correspondingly limited microprocessor clock speeds. Understanding how to efficiently utilize these systems in the context of demanding numerical algorithms is an urgent challenge to meet the ever growing computational needs of high-end computing. In this work, we examine multicore parallel optimization of the particle-to-grid interpolation step in particle-mesh methods, an inherently complex optimization problem due to its low computation intensity, irregular data accesses, and potential fine-grained data hazards. Our evaluated kernels are derived from two important numerical computations: a biological simulation of the heart using the Immersed Boundary (IB) method, and a Gyrokinetic Particle-in-Cell (PIC)-based application for studying fusion plasma microturbulence. We develop several novel synchronization and grid decomposition schemes, as well as low-level optimization techniques to maximize performance on three modern multicore platforms: Intel's Xeon X5550 (Nehalem), AMD's Opteron 2356 (Barcelona), and Sun's UltraSparc T2+ (Niagara). Results show that our optimizations lead to significant performance improvements, achieving up to a 5.6× speedup compared to the reference parallel implementation. Our work also provides valuable insight into the design of future autotuning frameworks for particle-to-grid interpolation on next-generation systems.
AB - We are now in the multicore revolution which is witnessing a rapid evolution of architectural designs due to power constraints and correspondingly limited microprocessor clock speeds. Understanding how to efficiently utilize these systems in the context of demanding numerical algorithms is an urgent challenge to meet the ever growing computational needs of high-end computing. In this work, we examine multicore parallel optimization of the particle-to-grid interpolation step in particle-mesh methods, an inherently complex optimization problem due to its low computation intensity, irregular data accesses, and potential fine-grained data hazards. Our evaluated kernels are derived from two important numerical computations: a biological simulation of the heart using the Immersed Boundary (IB) method, and a Gyrokinetic Particle-in-Cell (PIC)-based application for studying fusion plasma microturbulence. We develop several novel synchronization and grid decomposition schemes, as well as low-level optimization techniques to maximize performance on three modern multicore platforms: Intel's Xeon X5550 (Nehalem), AMD's Opteron 2356 (Barcelona), and Sun's UltraSparc T2+ (Niagara). Results show that our optimizations lead to significant performance improvements, achieving up to a 5.6× speedup compared to the reference parallel implementation. Our work also provides valuable insight into the design of future autotuning frameworks for particle-to-grid interpolation on next-generation systems.
UR - http://www.scopus.com/inward/record.url?scp=84865699225&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84865699225&partnerID=8YFLogxK
U2 - 10.1109/TPDS.2012.28
DO - 10.1109/TPDS.2012.28
M3 - Article
AN - SCOPUS:84865699225
SN - 1045-9219
VL - 23
SP - 1915
EP - 1922
JO - IEEE Transactions on Parallel and Distributed Systems
JF - IEEE Transactions on Parallel and Distributed Systems
IS - 10
M1 - 6133280
ER -