TY - GEN
T1 - Stash
T2 - 43rd IEEE International Conference on Distributed Computing Systems, ICDCS 2023
AU - Sharma, Aakash
AU - Bhasi, Vivek M.
AU - Singh, Sonali
AU - Jain, Rishabh
AU - Gunasekaran, Jashwant Raj
AU - Mitra, Subrata
AU - Kandemir, Mahmut Taylan
AU - Kesidis, George
AU - Das, Chita R.
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Deep neural networks (DNNs) are increasingly popular owing to their ability to solve complex problems such as image recognition, autonomous driving, and natural language processing. Their growing complexity coupled with the use of larger volumes of training data (to achieve acceptable accuracy) has warranted the use of GPUs and other accelerators. Such accelerators are typically expensive, with users having to pay a high upfront cost to acquire them. For infrequent use, users can, instead, leverage the public cloud to mitigate the high acquisition cost. However, with the wide diversity of hardware instances (particularly GPU instances) available in public cloud, it becomes challenging for a user to make an appropriate choice from a cost/performance standpoint. In this work, we try to address this problem by (i) introducing a comprehensive distributed deep learning (DDL) profiler Stash, which determines the various execution stalls that DDL suffers from, and (ii) using Stash to extensively characterize various public cloud GPU instances by running popular DNN models on them. Specifically, it estimates two types of communication stalls, namely, interconnect and network stalls, that play a dominant role in DDL execution time. Stash is implemented on top of prior work, DS-Analyzer, that computes only the CPU and disk stalls. Using our detailed stall characterization, we list the advantages and shortcomings of public cloud GPU instances for users to help them make an informed decision(s). Our characterization results indicate that the more expensive GPU instances may not be the most performant for all DNN models and that AWS can sometimes sub-optimally allocate hardware interconnect resources. Specifically, the intra-machine interconnect can introduce communication overheads of up to 90% of DNN training time and the network-connected instances can suffer from up to 5× slowdown compared to training on a single instance. Furthermore, (iii) we also model the impact of DNN macroscopic features such as the number of layers and the number of gradients on communication stalls, and finally, (iv) we briefly discuss a cost comparison with existing work.
AB - Deep neural networks (DNNs) are increasingly popular owing to their ability to solve complex problems such as image recognition, autonomous driving, and natural language processing. Their growing complexity coupled with the use of larger volumes of training data (to achieve acceptable accuracy) has warranted the use of GPUs and other accelerators. Such accelerators are typically expensive, with users having to pay a high upfront cost to acquire them. For infrequent use, users can, instead, leverage the public cloud to mitigate the high acquisition cost. However, with the wide diversity of hardware instances (particularly GPU instances) available in public cloud, it becomes challenging for a user to make an appropriate choice from a cost/performance standpoint. In this work, we try to address this problem by (i) introducing a comprehensive distributed deep learning (DDL) profiler Stash, which determines the various execution stalls that DDL suffers from, and (ii) using Stash to extensively characterize various public cloud GPU instances by running popular DNN models on them. Specifically, it estimates two types of communication stalls, namely, interconnect and network stalls, that play a dominant role in DDL execution time. Stash is implemented on top of prior work, DS-Analyzer, that computes only the CPU and disk stalls. Using our detailed stall characterization, we list the advantages and shortcomings of public cloud GPU instances for users to help them make an informed decision(s). Our characterization results indicate that the more expensive GPU instances may not be the most performant for all DNN models and that AWS can sometimes sub-optimally allocate hardware interconnect resources. Specifically, the intra-machine interconnect can introduce communication overheads of up to 90% of DNN training time and the network-connected instances can suffer from up to 5× slowdown compared to training on a single instance. Furthermore, (iii) we also model the impact of DNN macroscopic features such as the number of layers and the number of gradients on communication stalls, and finally, (iv) we briefly discuss a cost comparison with existing work.
UR - http://www.scopus.com/inward/record.url?scp=85175096168&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85175096168&partnerID=8YFLogxK
U2 - 10.1109/ICDCS57875.2023.00023
DO - 10.1109/ICDCS57875.2023.00023
M3 - Conference contribution
AN - SCOPUS:85175096168
T3 - Proceedings - International Conference on Distributed Computing Systems
SP - 875
EP - 886
BT - Proceedings - 2023 IEEE 43rd International Conference on Distributed Computing Systems, ICDCS 2023
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 18 July 2023 through 21 July 2023
ER -