TY - GEN
T1 - Metastable failures in distributed systems
AU - Bronson, Nathan
AU - Aghayev, Abutalib
AU - Charapko, Aleksey
AU - Zhu, Timothy
N1 - Publisher Copyright:
© 2021 ACM.
PY - 2021/6/1
Y1 - 2021/6/1
N2 - We describe metastable failures-A failure pattern in distributed systems. Currently, metastable failures manifest themselves as black swan events; they are outliers because nothing in the past points to their possibility, have a severe impact, and are much easier to explain in hindsight than to predict. Although instances of metastable failures can look different at the surface, deeper analysis shows that they can be understood within the same framework. We introduce a framework for thinking about metastable failures, apply it to examples observed during years of operating distributed systems at scale, and survey ad-hoc techniques developed post-factum for making systems resilient to known metastable failures. A systematic approach for building systems that are robust against unknown meta-stable failures remains an open problem.
AB - We describe metastable failures-A failure pattern in distributed systems. Currently, metastable failures manifest themselves as black swan events; they are outliers because nothing in the past points to their possibility, have a severe impact, and are much easier to explain in hindsight than to predict. Although instances of metastable failures can look different at the surface, deeper analysis shows that they can be understood within the same framework. We introduce a framework for thinking about metastable failures, apply it to examples observed during years of operating distributed systems at scale, and survey ad-hoc techniques developed post-factum for making systems resilient to known metastable failures. A systematic approach for building systems that are robust against unknown meta-stable failures remains an open problem.
UR - http://www.scopus.com/inward/record.url?scp=85107847964&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85107847964&partnerID=8YFLogxK
U2 - 10.1145/3458336.3465286
DO - 10.1145/3458336.3465286
M3 - Conference contribution
AN - SCOPUS:85107847964
T3 - HotOS 2021 - Proceedings of the 2021 Workshop on Hot Topics in Operating Systems
SP - 221
EP - 227
BT - HotOS 2021 - Proceedings of the 2021 Workshop on Hot Topics in Operating Systems
PB - Association for Computing Machinery, Inc
T2 - 18th Workshop on Hot Topics in Operating Systems, HotOS 2021
Y2 - 1 June 2021 through 3 June 2021
ER -