TY - GEN
T1 - Metastable Failures in the Wild
AU - Huang, Lexiang
AU - Magnusson, Matthew
AU - Muralikrishna, Abishek Bangalore
AU - Estyak, Salman
AU - Isaacs, Rebecca
AU - Aghayev, Abutalib
AU - Zhu, Timothy
AU - Charapko, Aleksey
N1 - Funding Information:
What sets a metastable failure apart from all of the above is that its root cause is not a specific hardware failure or a software bug. It is an emergent behavior of a complex system that naturally arises from optimizations for the common case. Specifically, if the aforementioned failures do not trigger a metastable failure, then identifying and eliminating them restores the system functionality. If, however, they do trigger a metastable failure, then eliminating them will not restore the system’s functionality. 8 Conclusion Metastable failures are a class of system failures characterized by sustaining effects that keep systems in a degraded state and resist recovery. While relatively infrequent, metastable failures were behind big outages at large internet companies (including a recent AWS outage on December 7th, 2021). In this work, we confirm this observation by studying public incident reports. We then extend the metastability framework based on our observations for a more accurate metastability model. We validate our model by building three applications and reproducing different instances of metastability on them. We hope our work spurs further research into understanding and preventing metastable failures. Acknowledgments We thank our shepherd Atul Adya and the anonymous reviewers who provided constructive and helpful feedback. We also thank Nathan Bronson for his insightful comments and suggestions. This research was supported in part by AWS Cloud Credit for Research.
Publisher Copyright:
© 2022 by The USENIX Association. All rights reserved.
PY - 2022
Y1 - 2022
N2 - Recently, Bronson et al. [7] introduced a framework for understanding a class of failures in distributed systems called metastable failures. The examples of metastable failures presented in that work are simplified versions of failures observed at Facebook. In this work, we study the prevalence of such failures in the wild by scouring over publicly available incident reports from many organizations, ranging from hyperscalers to small companies. Our main findings are threefold. First, metastable failures are universally observed-we present an in-depth study of 22 metastable failures from 11 different organizations. Second, metastable failures are a recurring pattern in many severe outages-e.g., at least 4 out of 15 major outages in the last decade at Amazon Web Services were caused by metastable failures. Third, we extend the model by Bronson et al. to better reflect the metastable failures seen in the wild by categorizing two types of triggers and two types of amplification mechanisms, which we confirm through developing multiple example applications that reproduce different types of metastable failures in a controlled environment. We believe our work will aid in a deeper understanding of metastable failures and in coming up with solutions to them.
AB - Recently, Bronson et al. [7] introduced a framework for understanding a class of failures in distributed systems called metastable failures. The examples of metastable failures presented in that work are simplified versions of failures observed at Facebook. In this work, we study the prevalence of such failures in the wild by scouring over publicly available incident reports from many organizations, ranging from hyperscalers to small companies. Our main findings are threefold. First, metastable failures are universally observed-we present an in-depth study of 22 metastable failures from 11 different organizations. Second, metastable failures are a recurring pattern in many severe outages-e.g., at least 4 out of 15 major outages in the last decade at Amazon Web Services were caused by metastable failures. Third, we extend the model by Bronson et al. to better reflect the metastable failures seen in the wild by categorizing two types of triggers and two types of amplification mechanisms, which we confirm through developing multiple example applications that reproduce different types of metastable failures in a controlled environment. We believe our work will aid in a deeper understanding of metastable failures and in coming up with solutions to them.
UR - http://www.scopus.com/inward/record.url?scp=85141043271&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85141043271&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85141043271
T3 - Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022
SP - 73
EP - 90
BT - Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022
PB - USENIX Association
T2 - 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022
Y2 - 11 July 2022 through 13 July 2022
ER -