Thorough characterization and analysis of large transformer model training at-scale

Scott Cheng, Jun Liang Lin, Murali Emani, Siddhisanket Raskar, Sam Foreman, Zhen Xie, Venkatram Vishwanath, Mahmut T. Kandemir

Research output: Contribution to journalArticlepeer-review

2 Scopus citations

Abstract

Large transformer models have recently achieved great success across various domains. With a growing number of model parameters, a large transformer model training today typically involves model sharding, data parallelism, and model parallelism. Thus, the throughput of large-scale model training depends heavily on the network bandwidth since a combination of model sharding and multiple parallelism strategies incurs various costs. However, prior characterizations of transformer models on high-bandwidth DGX machines that use TFLOPS as a metric may not reflect the performance of a system with lower bandwidth. Furthermore, data and model parallelism reveal significantly distinct training profiles on different system bandwidths at scale and, thus, need a thorough study. In this paper, we provide a bottom-up breakdown of training throughput into compute and communication time, and quantitatively analyze their respective influences on overall end-To-end training scaling. Our evaluation involves an in-depth exploration of data parallelism, scaling up to 512 GPUs with limited bandwidth, and examines three model sharding strategies among six model sizes. We also evaluate three combinations of model parallelism on both high and low bandwidth supercomputing systems. Overall, our work provides a broader perspective on large-scale transformer model training, and our analysis and evaluation yield practical insights for predicting training scaling, shaping the future development of supercomputing system design.

Original languageEnglish (US)
Article number8
JournalProceedings of the ACM on Measurement and Analysis of Computing Systems
Volume8
Issue number1
DOIs
StatePublished - Feb 21 2024

All Science Journal Classification (ASJC) codes

  • Computer Science (miscellaneous)
  • Safety, Risk, Reliability and Quality
  • Hardware and Architecture
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Thorough characterization and analysis of large transformer model training at-scale'. Together they form a unique fingerprint.

Cite this