3D coded SUMMA: Communication-efficient and robust parallel matrix multiplication

Haewon Jeong, Yaoqing Yang, Vipul Gupta, Christian Engelmann, Tze Meng Low, Viveck Cadambe, Kannan Ramchandran, Pulkit Grover

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this paper, we propose a novel fault-tolerant parallel matrix multiplication algorithm called 3D Coded SUMMA that achieves higher failure-tolerance than replication-based schemes for the same amount of redundancy. This work bridges the gap between recent developments in coded computing and fault-tolerance in high-performance computing (HPC). The core idea of coded computing is the same as algorithm-based fault-tolerance (ABFT), which is weaving redundancy in the computation using error-correcting codes. In particular, we show that MatDot codes, an innovative code construction for parallel matrix multiplications, can be integrated into three-dimensional SUMMA (Scalable Universal Matrix Multiplication Algorithm [30]) in a communication-avoiding manner. To tolerate any two node failures, the proposed 3D Coded SUMMA requires ~50% less redundancy than replication, while the overhead in execution time is only about 5–10%.

Original languageEnglish (US)
Title of host publicationEuro-Par 2020
Subtitle of host publicationParallel Processing - 26th International Conference on Parallel and Distributed Computing, Proceedings
EditorsMaciej Malawski, Krzysztof Rzadca
PublisherSpringer
Pages392-407
Number of pages16
ISBN (Print)9783030576745
DOIs
StatePublished - 2020
Event26th International European Conference on Parallel and Distributed Computing, Euro-Par 2020 - Warsaw, Poland
Duration: Aug 24 2020Aug 28 2020

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12247 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference26th International European Conference on Parallel and Distributed Computing, Euro-Par 2020
Country/TerritoryPoland
CityWarsaw
Period8/24/208/28/20

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint

Dive into the research topics of '3D coded SUMMA: Communication-efficient and robust parallel matrix multiplication'. Together they form a unique fingerprint.

Cite this