TY - GEN
T1 - BlueGene/L failure analysis and prediction models
AU - Liang, Yinglung
AU - Zhang, Yanyong
AU - Jette, Morris
AU - Sivasubramaniam, Anand
AU - Sahoo, Ramendra
N1 - Copyright:
Copyright 2011 Elsevier B.V., All rights reserved.
PY - 2006
Y1 - 2006
N2 - The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBM's BlueGene/L which can accommodate as many as 128K processors. One of the challenges when designing and deploying these systems in a production setting is the need to take failure occurrences, whether it be in the hardware or in the software, into account. Earlier work has shown that conventional runtime fault-tolerant techniques such as periodic checkpointing are not effective to the emerging systems. Instead, the ability to predict failure occurrences can help develop more effective checkpointing strategies. Failure prediction has long been regarded as a challenging research problem, mainly due to the lack of realistic failure data from actual production systems. In this study, we have collected RAS event logs from BlueGene/L over a period of more than 100 days. We have investigated the characteristics of fatal failure events, as well as the correlation between fatal events and non-fatal events. Based on the observations, we have developed three simple yet effective failure prediction methods, which can predict around 80% of the memory and network failures, and 47% of the application I/O failures.
AB - The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBM's BlueGene/L which can accommodate as many as 128K processors. One of the challenges when designing and deploying these systems in a production setting is the need to take failure occurrences, whether it be in the hardware or in the software, into account. Earlier work has shown that conventional runtime fault-tolerant techniques such as periodic checkpointing are not effective to the emerging systems. Instead, the ability to predict failure occurrences can help develop more effective checkpointing strategies. Failure prediction has long been regarded as a challenging research problem, mainly due to the lack of realistic failure data from actual production systems. In this study, we have collected RAS event logs from BlueGene/L over a period of more than 100 days. We have investigated the characteristics of fatal failure events, as well as the correlation between fatal events and non-fatal events. Based on the observations, we have developed three simple yet effective failure prediction methods, which can predict around 80% of the memory and network failures, and 47% of the application I/O failures.
UR - http://www.scopus.com/inward/record.url?scp=33845589803&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=33845589803&partnerID=8YFLogxK
U2 - 10.1109/DSN.2006.18
DO - 10.1109/DSN.2006.18
M3 - Conference contribution
AN - SCOPUS:33845589803
SN - 0769526071
SN - 9780769526072
T3 - Proceedings of the International Conference on Dependable Systems and Networks
SP - 425
EP - 434
BT - Proceedings - DSN 2006
T2 - DSN 2006: 2006 International Conference on Dependable Systems and Networks
Y2 - 25 June 2006 through 28 June 2006
ER -