TY - JOUR
T1 - μc-States
T2 - 25th International Conference on Parallel Architectures and Compilation Techniques, PACT 2016
AU - Kayiran, Onur
AU - Jog, Adwait
AU - Pattnaik, Ashutosh
AU - Ausavarungnirun, Rachata
AU - Tang, Xulong
AU - Kandemir, Mahmut T.
AU - Loh, Gabriel H.
AU - Mutlu, Onur
AU - Das, Chita R.
N1 - Funding Information:
Acknowledgments The authors would like to thank the reviewers for their feed-back. This research is supported in part by NSF grants #1205618, #1213052, #1212962, #1302225, #1302557, #1317560, #1320478, #1320531, #1409095, #1409723, #1439021, #1439057, and #1526750. Adwait Jog acknowl-edges the start-up grant from the College of William and Mary. AMD, the AMD Arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for iden-tification purposes only and may be trademarks of their re-spective companies.
Publisher Copyright:
© 2016 ACM.
PY - 2016
Y1 - 2016
N2 - To improve the performance of Graphics Processing Units (GPUs) beyond simply increasing core count, architects are recently adopting a scale-up approach: the peak throughput and individual capabilities of the GPU cores are increasing rapidly. This big-core trend in GPUs leads to various challenges, including higher static power consumption and lower and imbalanced utilization of the datapath components of a big core. As we show in this paper, two key problems ensue: (1) the lower and imbalanced datapath utilization can waste power as an application does not always utilize all portions of the big core datapath, and (2) the use of big cores can lead to application performance degradation in some cases due to the higher memory system contention caused by the more memory requests generated by each big core. This paper introduces a new analysis of datapath component utilization in big-core GPUs based on queuing theory principles. Building on this analysis, we introduce a fine-grained dynamic power-and clock-gating mechanism for the entire datapath, called C-States, which aims to minimize power consumption by turning off or tuning-down datapath components that are not bottlenecks for the performance of the running application. Our experimental evaluation demonstrates that C-States significantly reduces both static and dynamic power consumption in a big-core GPU, while also significantly improving the performance of applications affected by high memory system contention. We also show that our analysis of datapath component utilization can guide scheduling and design decisions in a GPU architecture that contains heterogeneous cores.
AB - To improve the performance of Graphics Processing Units (GPUs) beyond simply increasing core count, architects are recently adopting a scale-up approach: the peak throughput and individual capabilities of the GPU cores are increasing rapidly. This big-core trend in GPUs leads to various challenges, including higher static power consumption and lower and imbalanced utilization of the datapath components of a big core. As we show in this paper, two key problems ensue: (1) the lower and imbalanced datapath utilization can waste power as an application does not always utilize all portions of the big core datapath, and (2) the use of big cores can lead to application performance degradation in some cases due to the higher memory system contention caused by the more memory requests generated by each big core. This paper introduces a new analysis of datapath component utilization in big-core GPUs based on queuing theory principles. Building on this analysis, we introduce a fine-grained dynamic power-and clock-gating mechanism for the entire datapath, called C-States, which aims to minimize power consumption by turning off or tuning-down datapath components that are not bottlenecks for the performance of the running application. Our experimental evaluation demonstrates that C-States significantly reduces both static and dynamic power consumption in a big-core GPU, while also significantly improving the performance of applications affected by high memory system contention. We also show that our analysis of datapath component utilization can guide scheduling and design decisions in a GPU architecture that contains heterogeneous cores.
UR - http://www.scopus.com/inward/record.url?scp=84989291136&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84989291136&partnerID=8YFLogxK
U2 - 10.1145/2967938.2967941
DO - 10.1145/2967938.2967941
M3 - Conference article
AN - SCOPUS:84989291136
SN - 1089-795X
SP - 17
EP - 30
JO - Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT
JF - Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT
Y2 - 11 September 2016 through 15 September 2016
ER -