TY - GEN
T1 - A CPU-GPU hybrid implementation and model-driven scheduling of the fast multipole method
AU - Choi, Jee
AU - Chandramowlishwaran, Aparna
AU - Madduri, Kamesh
AU - Vuduc, Richard
PY - 2014
Y1 - 2014
N2 - This paper presents an optimized CPU-GPU hybrid implementation and a GPU performance model for the kernelindependent fast multipole method (FMM). We implement an optimized kernel-independent FMM for GPUs, and combine it with our previous CPU implementation to create a hybrid CPU+GPU FMM kernel. When compared to another highly optimized GPU implementation, our implementation achieves as much as a 1.9× speedup. We then extend our previous lower bound analyses of FMM for CPUs to include GPUs. This yields a model for predicting the execution times of the different phases of FMM. Using this information, we estimate the execution times of a set of static hybrid schedules on a given system, which allows us to automatically choose the schedule that yields the best performance. In the best case, we achieve a speedup of 1.5× compared to our GPU-only implementation, despite the large difference in computational powers of CPUs and GPUs. We comment on one consequence of having such performance models, which is to enable speculative predictions about FMM scalability on future systems.
AB - This paper presents an optimized CPU-GPU hybrid implementation and a GPU performance model for the kernelindependent fast multipole method (FMM). We implement an optimized kernel-independent FMM for GPUs, and combine it with our previous CPU implementation to create a hybrid CPU+GPU FMM kernel. When compared to another highly optimized GPU implementation, our implementation achieves as much as a 1.9× speedup. We then extend our previous lower bound analyses of FMM for CPUs to include GPUs. This yields a model for predicting the execution times of the different phases of FMM. Using this information, we estimate the execution times of a set of static hybrid schedules on a given system, which allows us to automatically choose the schedule that yields the best performance. In the best case, we achieve a speedup of 1.5× compared to our GPU-only implementation, despite the large difference in computational powers of CPUs and GPUs. We comment on one consequence of having such performance models, which is to enable speculative predictions about FMM scalability on future systems.
UR - http://www.scopus.com/inward/record.url?scp=84898787906&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84898787906&partnerID=8YFLogxK
U2 - 10.1145/2576779.2576787
DO - 10.1145/2576779.2576787
M3 - Conference contribution
AN - SCOPUS:84898787906
SN - 9781450327664
T3 - ACM International Conference Proceeding Series
SP - 64
EP - 71
BT - Proceedings of the 7th Workshop on General Purpose Processing Using Graphics Processing Units, GPGPU 2014
PB - Association for Computing Machinery
T2 - 7th Workshop on General Purpose Processing Using Graphics Processing Units, GPGPU 2014
Y2 - 1 March 2014 through 1 March 2014
ER -