TY - GEN
T1 - Controlled Kernel Launch for Dynamic Parallelism in GPUs
AU - Tang, Xulong
AU - Pattnaik, Ashutosh
AU - Jiang, Huaipan
AU - Kayiran, Onur
AU - Jog, Adwait
AU - Pai, Sreepathi
AU - Ibrahim, Mohamed
AU - Kandemir, Mahmut T.
AU - Das, Chita R.
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/5/5
Y1 - 2017/5/5
N2 - Dynamic parallelism (DP) is a promising feature for GPUs, which allows on-demand spawning of kernels on the GPU without any CPU intervention. However, this feature has two major drawbacks. First, the launching of GPU kernels can incur significant performance penalties. Second, dynamically-generated kernels are not always able to efficiently utilize the GPU cores due to hardware-limits. To address these two concerns cohesively, we propose SPAWN, a runtime framework that controls the dynamically-generated kernels, thereby directly reducing the associated launch overheads and queuing latency. Moreover, it allows a better mix of dynamically-generated and original (parent) kernels for the scheduler to effectively hide the remaining overheads and improve the utilization of the GPU resources. Our results show that, across 13 benchmarks, SPAWN achieves 69% and 57% speedup over the flat (non-DP) implementation and baseline DP, respectively.
AB - Dynamic parallelism (DP) is a promising feature for GPUs, which allows on-demand spawning of kernels on the GPU without any CPU intervention. However, this feature has two major drawbacks. First, the launching of GPU kernels can incur significant performance penalties. Second, dynamically-generated kernels are not always able to efficiently utilize the GPU cores due to hardware-limits. To address these two concerns cohesively, we propose SPAWN, a runtime framework that controls the dynamically-generated kernels, thereby directly reducing the associated launch overheads and queuing latency. Moreover, it allows a better mix of dynamically-generated and original (parent) kernels for the scheduler to effectively hide the remaining overheads and improve the utilization of the GPU resources. Our results show that, across 13 benchmarks, SPAWN achieves 69% and 57% speedup over the flat (non-DP) implementation and baseline DP, respectively.
UR - http://www.scopus.com/inward/record.url?scp=85019582959&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85019582959&partnerID=8YFLogxK
U2 - 10.1109/HPCA.2017.14
DO - 10.1109/HPCA.2017.14
M3 - Conference contribution
AN - SCOPUS:85019582959
T3 - Proceedings - International Symposium on High-Performance Computer Architecture
SP - 649
EP - 660
BT - Proceedings - 2017 IEEE 23rd Symposium on High Performance Computer Architecture, HPCA 2017
PB - IEEE Computer Society
T2 - 23rd IEEE Symposium on High Performance Computer Architecture, HPCA 2017
Y2 - 4 February 2017 through 8 February 2017
ER -