TY - GEN
T1 - 3D coded SUMMA
T2 - 26th International European Conference on Parallel and Distributed Computing, Euro-Par 2020
AU - Jeong, Haewon
AU - Yang, Yaoqing
AU - Gupta, Vipul
AU - Engelmann, Christian
AU - Low, Tze Meng
AU - Cadambe, Viveck
AU - Ramchandran, Kannan
AU - Grover, Pulkit
N1 - Funding Information:
This work was sponsored by the U.S. Department of Energy’s Office of Advanced Scientific Computing Research. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/ downloads/doe-public-access-plan).. Acknowledgements and Data Availability Statement. This project was supported partially by NSF grant CCF 1763657 and supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, program managers Robinson Pino and Lucy Nowell. This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The datasets and code generated during and/or analysed during the current study are available in the Figshare repository: https://doi.org/10.6084/m9.figshare.12560 330 [19].
Publisher Copyright:
© Springer Nature Switzerland AG 2020.
PY - 2020
Y1 - 2020
N2 - In this paper, we propose a novel fault-tolerant parallel matrix multiplication algorithm called 3D Coded SUMMA that achieves higher failure-tolerance than replication-based schemes for the same amount of redundancy. This work bridges the gap between recent developments in coded computing and fault-tolerance in high-performance computing (HPC). The core idea of coded computing is the same as algorithm-based fault-tolerance (ABFT), which is weaving redundancy in the computation using error-correcting codes. In particular, we show that MatDot codes, an innovative code construction for parallel matrix multiplications, can be integrated into three-dimensional SUMMA (Scalable Universal Matrix Multiplication Algorithm [30]) in a communication-avoiding manner. To tolerate any two node failures, the proposed 3D Coded SUMMA requires ~50% less redundancy than replication, while the overhead in execution time is only about 5–10%.
AB - In this paper, we propose a novel fault-tolerant parallel matrix multiplication algorithm called 3D Coded SUMMA that achieves higher failure-tolerance than replication-based schemes for the same amount of redundancy. This work bridges the gap between recent developments in coded computing and fault-tolerance in high-performance computing (HPC). The core idea of coded computing is the same as algorithm-based fault-tolerance (ABFT), which is weaving redundancy in the computation using error-correcting codes. In particular, we show that MatDot codes, an innovative code construction for parallel matrix multiplications, can be integrated into three-dimensional SUMMA (Scalable Universal Matrix Multiplication Algorithm [30]) in a communication-avoiding manner. To tolerate any two node failures, the proposed 3D Coded SUMMA requires ~50% less redundancy than replication, while the overhead in execution time is only about 5–10%.
UR - http://www.scopus.com/inward/record.url?scp=85090094091&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85090094091&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-57675-2_25
DO - 10.1007/978-3-030-57675-2_25
M3 - Conference contribution
AN - SCOPUS:85090094091
SN - 9783030576745
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 392
EP - 407
BT - Euro-Par 2020
A2 - Malawski, Maciej
A2 - Rzadca, Krzysztof
PB - Springer
Y2 - 24 August 2020 through 28 August 2020
ER -