TY - JOUR
T1 - Thorough Characterization and Analysis of Large Transformer Model Training At-Scale
AU - Cheng, Scott
AU - Lin, Jun Liang
AU - Emani, Murali
AU - Raskar, Siddhisanket
AU - Foreman, Sam
AU - Xie, Zhen
AU - Vishwanath, Venkatram
AU - Kandemir, Mahmut T.
N1 - Publisher Copyright:
© 2024 Owner/Author.
PY - 2024/6/10
Y1 - 2024/6/10
N2 - Large transformer models have recently achieved great success across various domains. With a growing number of model parameters, a large transformer model training today typically involves model sharding, data parallelism, and model parallelism. Thus, the throughput of large-scale model training depends heavily on the network bandwidth since a combination of model sharding and multiple parallelism strategies incurs various costs. However, prior characterizations of transformer models on high-bandwidth DGX machines that use TFLOPS as a metric may not reflect the performance of a system with lower bandwidth. Furthermore, data and model parallelism reveal significantly distinct training profiles on different system bandwidths at scale and, thus, need a thorough study. In this paper, we provide a bottom-up breakdown of training throughput into compute and communication time, and quantitatively analyze their respective influences on overall end-to-end training scaling. Our evaluation involves an in-depth exploration of data parallelism, scaling up to 512 GPUs with limited bandwidth, and examines three model sharding strategies among six model sizes. We also evaluate three combinations of model parallelism on both high and low bandwidth supercomputing systems. Overall, our work provides a broader perspective on large-scale transformer model training, and our analysis and evaluation yield practical insights for predicting training scaling, shaping the future development of supercomputing system design.
AB - Large transformer models have recently achieved great success across various domains. With a growing number of model parameters, a large transformer model training today typically involves model sharding, data parallelism, and model parallelism. Thus, the throughput of large-scale model training depends heavily on the network bandwidth since a combination of model sharding and multiple parallelism strategies incurs various costs. However, prior characterizations of transformer models on high-bandwidth DGX machines that use TFLOPS as a metric may not reflect the performance of a system with lower bandwidth. Furthermore, data and model parallelism reveal significantly distinct training profiles on different system bandwidths at scale and, thus, need a thorough study. In this paper, we provide a bottom-up breakdown of training throughput into compute and communication time, and quantitatively analyze their respective influences on overall end-to-end training scaling. Our evaluation involves an in-depth exploration of data parallelism, scaling up to 512 GPUs with limited bandwidth, and examines three model sharding strategies among six model sizes. We also evaluate three combinations of model parallelism on both high and low bandwidth supercomputing systems. Overall, our work provides a broader perspective on large-scale transformer model training, and our analysis and evaluation yield practical insights for predicting training scaling, shaping the future development of supercomputing system design.
UR - http://www.scopus.com/inward/record.url?scp=85196397902&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85196397902&partnerID=8YFLogxK
U2 - 10.1145/3673660.3655087
DO - 10.1145/3673660.3655087
M3 - Article
AN - SCOPUS:85196397902
SN - 0163-5999
VL - 52
SP - 39
EP - 40
JO - Performance Evaluation Review
JF - Performance Evaluation Review
IS - 1
ER -