TY - JOUR
T1 - Addressing Unreliability in Emerging Devices and Non-von Neumann Architectures Using Coded Computing
AU - Dutta, Sanghamitra
AU - Jeong, Haewon
AU - Yang, Yaoqing
AU - Cadambe, Viveck
AU - Low, Tze Meng
AU - Grover, Pulkit
N1 - Funding Information:
Manuscript received February 1, 2019; revised March 19, 2020; accepted March 20, 2020. Date of publication May 14, 2020; date of current version July 17, 2020. This work was supported by NSF under Grant 1763561. (Sanghamitra Dutta, Haewon Jeong, and Yaoqing Yang contributed equally to this work.) (Corresponding author: Pulkit Grover.) Sanghamitra Dutta, Haewon Jeong, Tze Meng Low, and Pulkit Grover are with the Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213 USA (e-mail: pulkit@cmu.edu). Yaoqing Yang is with the Department of Electrical Engineering and Computer Science, University of California at Berkeley (UC Berkeley), Berkeley, CA 94720 USA. Viveck Cadambe is with the Department of Electrical Engineering, Penn State University, State College, PA 16801 USA.
Publisher Copyright:
© 1963-2012 IEEE.
PY - 2020/8
Y1 - 2020/8
N2 - Computing systems are evolving rapidly. At the device level, emerging devices are beginning to compete with traditional CMOS systems. At the architecture level, novel architectures are successfully avoiding the communication bottleneck that is a central feature, and a central limitation, of the von Neumann architecture. Furthermore, such systems are increasingly plagued by unreliability. This unreliability arises at device or gate-level in emerging devices, and can percolate up to processor or system-level if left unchecked. The goal of this article is to survey recent advances in reliable computing using unreliable elements, with an eye on nonsilicon and non-von Neumann architectures. We first observe that instead of aiming for generic computing problems, the community could use 'dwarfs of modern computing,' first noted in the high-performance computing (HPC) community, as a starting point. These computing problems are the basic building blocks of almost all scientific computing, machine learning, and data analytics today. Next, we survey the state of the art in 'coded computing,' which is an emerging area that advances on classical algorithm-based fault-tolerance (ABFT) and brings a fundamental information-theoretic perspective. By weaving error-correcting codes into a computing algorithm, coded computing provides dramatic improvements on solutions, as well as obtains novel fundamental limits, for problems that have been open for more than 30 years. We introduce existing and novel coded computing techniques in the context of 'coded dwarfs,' where a specific dwarf's computation is made resilient by applying coding. We discuss how, for the same redundancy, 'coded dwarfs' are significantly more resilient compared to classical techniques such as replication. Furthermore, by examining a widely popular computation task - training large neural networks - we demonstrate how coded dwarfs can be applied to address this fundamentally nonlinear problem. Finally, we discuss practical challenges and future directions in implementing coded computing techniques on emerging and existing nonsilicon and/or non-von Neumann architectures.
AB - Computing systems are evolving rapidly. At the device level, emerging devices are beginning to compete with traditional CMOS systems. At the architecture level, novel architectures are successfully avoiding the communication bottleneck that is a central feature, and a central limitation, of the von Neumann architecture. Furthermore, such systems are increasingly plagued by unreliability. This unreliability arises at device or gate-level in emerging devices, and can percolate up to processor or system-level if left unchecked. The goal of this article is to survey recent advances in reliable computing using unreliable elements, with an eye on nonsilicon and non-von Neumann architectures. We first observe that instead of aiming for generic computing problems, the community could use 'dwarfs of modern computing,' first noted in the high-performance computing (HPC) community, as a starting point. These computing problems are the basic building blocks of almost all scientific computing, machine learning, and data analytics today. Next, we survey the state of the art in 'coded computing,' which is an emerging area that advances on classical algorithm-based fault-tolerance (ABFT) and brings a fundamental information-theoretic perspective. By weaving error-correcting codes into a computing algorithm, coded computing provides dramatic improvements on solutions, as well as obtains novel fundamental limits, for problems that have been open for more than 30 years. We introduce existing and novel coded computing techniques in the context of 'coded dwarfs,' where a specific dwarf's computation is made resilient by applying coding. We discuss how, for the same redundancy, 'coded dwarfs' are significantly more resilient compared to classical techniques such as replication. Furthermore, by examining a widely popular computation task - training large neural networks - we demonstrate how coded dwarfs can be applied to address this fundamentally nonlinear problem. Finally, we discuss practical challenges and future directions in implementing coded computing techniques on emerging and existing nonsilicon and/or non-von Neumann architectures.
UR - http://www.scopus.com/inward/record.url?scp=85085113634&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85085113634&partnerID=8YFLogxK
U2 - 10.1109/JPROC.2020.2986362
DO - 10.1109/JPROC.2020.2986362
M3 - Article
AN - SCOPUS:85085113634
SN - 0018-9219
VL - 108
SP - 1219
EP - 1234
JO - Proceedings of the IEEE
JF - Proceedings of the IEEE
IS - 8
M1 - 9093912
ER -