TY - GEN
T1 - Predicting GPU Failures With High Precision Under Deep Learning Workloads
AU - Liu, Heting
AU - Li, Zhichao
AU - Tan, Cheng
AU - Yang, Rongqiu
AU - Cao, Guohong
AU - Liu, Zherui
AU - Guo, Chuanxiong
N1 - Publisher Copyright:
© 2023 ACM.
PY - 2023/6/5
Y1 - 2023/6/5
N2 - Graphics processing units (GPUs) are the de facto standard for processing deep learning (DL) tasks. In large-scale GPU clusters, GPU failures are inevitable and may cause severe consequences. For example, GPU failures disrupt distributed training, crash inference services, and result in service level agreement violations. In this paper, we study the problem of predicting GPU failures using machine learning (ML) models to mitigate their damages.We train prediction models on a four-month production dataset with 350 million entries at ByteDance. We observe that classic prediction models (GBDT, MLP, LSTM, and 1D-CNN) do not perform well - -they are inaccurate for predictions and unstable over time. We propose several techniques to improve the precision and stability of predictions, including parallel and cascade model-ensemble mechanisms and a sliding training method. We evaluate the performance of our proposed techniques. The results show that our proposed techniques improve the prediction precision from 46.3% to 85.4% on production workloads.
AB - Graphics processing units (GPUs) are the de facto standard for processing deep learning (DL) tasks. In large-scale GPU clusters, GPU failures are inevitable and may cause severe consequences. For example, GPU failures disrupt distributed training, crash inference services, and result in service level agreement violations. In this paper, we study the problem of predicting GPU failures using machine learning (ML) models to mitigate their damages.We train prediction models on a four-month production dataset with 350 million entries at ByteDance. We observe that classic prediction models (GBDT, MLP, LSTM, and 1D-CNN) do not perform well - -they are inaccurate for predictions and unstable over time. We propose several techniques to improve the precision and stability of predictions, including parallel and cascade model-ensemble mechanisms and a sliding training method. We evaluate the performance of our proposed techniques. The results show that our proposed techniques improve the prediction precision from 46.3% to 85.4% on production workloads.
UR - http://www.scopus.com/inward/record.url?scp=85165982240&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85165982240&partnerID=8YFLogxK
U2 - 10.1145/3579370.3594777
DO - 10.1145/3579370.3594777
M3 - Conference contribution
AN - SCOPUS:85165982240
T3 - Proceedings of the 16th ACM International Conference on Systems and Storage, SYSTOR 2023
SP - 124
EP - 135
BT - Proceedings of the 16th ACM International Conference on Systems and Storage, SYSTOR 2023
PB - Association for Computing Machinery, Inc
T2 - 16th ACM International Conference on Systems and Storage, SYSTOR 2023
Y2 - 5 June 2023 through 7 June 2023
ER -