TY - GEN
T1 - Thorough Characterization and Analysis of Large Transformer Model Training At-Scale
AU - Cheng, Scott
AU - Lin, Jun Liang
AU - Emani, Murali
AU - Raskar, Siddhisanket
AU - Foreman, Sam
AU - Xie, Zhen
AU - Vishwanath, Venkatram
AU - Kandemir, Mahmut T.
N1 - Publisher Copyright:
© 2024 Copyright held by the owner/author(s).
PY - 2024/6/10
Y1 - 2024/6/10
N2 - Large transformer models have recently achieved great success across various domains. With a growing number of model parameters, a large transformer model training today typically involves model sharding, data parallelism, and model parallelism. Thus, the throughput of large-scale model training depends heavily on the network bandwidth since a combination of model sharding and multiple parallelism strategies incurs various costs. However, prior characterizations of transformer models on high-bandwidth DGX machines that use TFLOPS as a metric may not reflect the performance of a system with lower bandwidth. Furthermore, data and model parallelism reveal significantly distinct training profiles on different system bandwidths at scale and, thus, need a thorough study. In this paper, we provide a bottom-up breakdown of training throughput into compute and communication time, and quantitatively analyze their respective influences on overall end-To-end training scaling. Our evaluation involves an in-depth exploration of data parallelism, scaling up to 512 GPUs with limited bandwidth, and examines three model sharding strategies among six model sizes. We also evaluate three combinations of model parallelism on both high and low bandwidth supercomputing systems. Overall, our work provides a broader perspective on large-scale transformer model training, and our analysis and evaluation yield practical insights for predicting training scaling, shaping the future development of supercomputing system design.
AB - Large transformer models have recently achieved great success across various domains. With a growing number of model parameters, a large transformer model training today typically involves model sharding, data parallelism, and model parallelism. Thus, the throughput of large-scale model training depends heavily on the network bandwidth since a combination of model sharding and multiple parallelism strategies incurs various costs. However, prior characterizations of transformer models on high-bandwidth DGX machines that use TFLOPS as a metric may not reflect the performance of a system with lower bandwidth. Furthermore, data and model parallelism reveal significantly distinct training profiles on different system bandwidths at scale and, thus, need a thorough study. In this paper, we provide a bottom-up breakdown of training throughput into compute and communication time, and quantitatively analyze their respective influences on overall end-To-end training scaling. Our evaluation involves an in-depth exploration of data parallelism, scaling up to 512 GPUs with limited bandwidth, and examines three model sharding strategies among six model sizes. We also evaluate three combinations of model parallelism on both high and low bandwidth supercomputing systems. Overall, our work provides a broader perspective on large-scale transformer model training, and our analysis and evaluation yield practical insights for predicting training scaling, shaping the future development of supercomputing system design.
UR - http://www.scopus.com/inward/record.url?scp=85196385148&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85196385148&partnerID=8YFLogxK
U2 - 10.1145/3652963.3655087
DO - 10.1145/3652963.3655087
M3 - Conference contribution
AN - SCOPUS:85196385148
T3 - SIGMETRICS/PERFORMANCE 2024 - Abstracts of the 2024 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems
SP - 39
EP - 40
BT - SIGMETRICS/PERFORMANCE 2024 - Abstracts of the 2024 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems
PB - Association for Computing Machinery, Inc
T2 - 2024 ACM SIGMETRICS/IFIP Performance Conference on Measurement and Modeling of Computer Systems, SIGMETRICS/PERFORMANCE 2024
Y2 - 10 June 2024 through 14 June 2024
ER -