Thorough Characterization and Analysis of Large Transformer Model Training At-Scale

Scott Cheng, Jun Liang Lin, Murali Emani, Siddhisanket Raskar, Sam Foreman, Zhen Xie, Venkatram Vishwanath, Mahmut T. Kandemir

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Large transformer models have recently achieved great success across various domains. With a growing number of model parameters, a large transformer model training today typically involves model sharding, data parallelism, and model parallelism. Thus, the throughput of large-scale model training depends heavily on the network bandwidth since a combination of model sharding and multiple parallelism strategies incurs various costs. However, prior characterizations of transformer models on high-bandwidth DGX machines that use TFLOPS as a metric may not reflect the performance of a system with lower bandwidth. Furthermore, data and model parallelism reveal significantly distinct training profiles on different system bandwidths at scale and, thus, need a thorough study. In this paper, we provide a bottom-up breakdown of training throughput into compute and communication time, and quantitatively analyze their respective influences on overall end-To-end training scaling. Our evaluation involves an in-depth exploration of data parallelism, scaling up to 512 GPUs with limited bandwidth, and examines three model sharding strategies among six model sizes. We also evaluate three combinations of model parallelism on both high and low bandwidth supercomputing systems. Overall, our work provides a broader perspective on large-scale transformer model training, and our analysis and evaluation yield practical insights for predicting training scaling, shaping the future development of supercomputing system design.

Original languageEnglish (US)
Title of host publicationSIGMETRICS/PERFORMANCE 2024 - Abstracts of the 2024 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems
PublisherAssociation for Computing Machinery, Inc
Pages39-40
Number of pages2
ISBN (Electronic)9798400706240
DOIs
StatePublished - Jun 10 2024
Event2024 ACM SIGMETRICS/IFIP Performance Conference on Measurement and Modeling of Computer Systems, SIGMETRICS/PERFORMANCE 2024 - Venice, Italy
Duration: Jun 10 2024Jun 14 2024

Publication series

NameSIGMETRICS/PERFORMANCE 2024 - Abstracts of the 2024 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems

Conference

Conference2024 ACM SIGMETRICS/IFIP Performance Conference on Measurement and Modeling of Computer Systems, SIGMETRICS/PERFORMANCE 2024
Country/TerritoryItaly
CityVenice
Period6/10/246/14/24

All Science Journal Classification (ASJC) codes

  • Computational Theory and Mathematics
  • Computer Networks and Communications
  • Hardware and Architecture

Cite this