TY - GEN
T1 - High-Performance Graph Analytics on Manycore Processors
AU - Slota, George M.
AU - Rajamanickam, Sivasankaran
AU - Madduri, Kamesh
N1 - Publisher Copyright:
© 2015 IEEE.
PY - 2015/7/17
Y1 - 2015/7/17
N2 - The divergence in the computer architecture landscape has resulted in different architectures being considered mainstream at the same time. For application and algorithm developers, a dilemma arises when one must focus on using underlying architectural features to extract the best performance on each of these architectures, while writing portable code at the same time. We focus on this problem with graph analytics as our target application domain. In this paper, we present an abstraction-based methodology for performance-portable graph algorithm design on manicure architectures. We demonstrate our approach by systematically optimizing algorithms for the problems of breadth-first search, color propagation, and strongly connected components. We use Kokkos, a manicure library and programming model, for prototyping our algorithms. Our portable implementation of the strongly connected components algorithm on the NVIDIA Tesla K40M is up to 3.25× faster than a state-of-the-art parallel CPU implementation on a dual-socket Sandy Bridge compute node.
AB - The divergence in the computer architecture landscape has resulted in different architectures being considered mainstream at the same time. For application and algorithm developers, a dilemma arises when one must focus on using underlying architectural features to extract the best performance on each of these architectures, while writing portable code at the same time. We focus on this problem with graph analytics as our target application domain. In this paper, we present an abstraction-based methodology for performance-portable graph algorithm design on manicure architectures. We demonstrate our approach by systematically optimizing algorithms for the problems of breadth-first search, color propagation, and strongly connected components. We use Kokkos, a manicure library and programming model, for prototyping our algorithms. Our portable implementation of the strongly connected components algorithm on the NVIDIA Tesla K40M is up to 3.25× faster than a state-of-the-art parallel CPU implementation on a dual-socket Sandy Bridge compute node.
UR - http://www.scopus.com/inward/record.url?scp=84971474108&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84971474108&partnerID=8YFLogxK
U2 - 10.1109/IPDPS.2015.54
DO - 10.1109/IPDPS.2015.54
M3 - Conference contribution
AN - SCOPUS:84971474108
T3 - Proceedings - 2015 IEEE 29th International Parallel and Distributed Processing Symposium, IPDPS 2015
SP - 17
EP - 27
BT - Proceedings - 2015 IEEE 29th International Parallel and Distributed Processing Symposium, IPDPS 2015
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 29th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2015
Y2 - 25 May 2015 through 29 May 2015
ER -