Metastable Failures in the Wild

Lexiang Huang, Matthew Magnusson, Abishek Bangalore Muralikrishna, Salman Estyak, Rebecca Isaacs, Abutalib Aghayev, Timothy Zhu, Aleksey Charapko

Research output: Chapter in Book/Report/Conference proceedingConference contribution

19 Scopus citations

Abstract

Recently, Bronson et al. [7] introduced a framework for understanding a class of failures in distributed systems called metastable failures. The examples of metastable failures presented in that work are simplified versions of failures observed at Facebook. In this work, we study the prevalence of such failures in the wild by scouring over publicly available incident reports from many organizations, ranging from hyperscalers to small companies. Our main findings are threefold. First, metastable failures are universally observed-we present an in-depth study of 22 metastable failures from 11 different organizations. Second, metastable failures are a recurring pattern in many severe outages-e.g., at least 4 out of 15 major outages in the last decade at Amazon Web Services were caused by metastable failures. Third, we extend the model by Bronson et al. to better reflect the metastable failures seen in the wild by categorizing two types of triggers and two types of amplification mechanisms, which we confirm through developing multiple example applications that reproduce different types of metastable failures in a controlled environment. We believe our work will aid in a deeper understanding of metastable failures and in coming up with solutions to them.

Original languageEnglish (US)
Title of host publicationProceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022
PublisherUSENIX Association
Pages73-90
Number of pages18
ISBN (Electronic)9781939133281
StatePublished - 2022
Event16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022 - Carlsbad, United States
Duration: Jul 11 2022Jul 13 2022

Publication series

NameProceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022

Conference

Conference16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022
Country/TerritoryUnited States
CityCarlsbad
Period7/11/227/13/22

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Hardware and Architecture
  • Information Systems

Fingerprint

Dive into the research topics of 'Metastable Failures in the Wild'. Together they form a unique fingerprint.

Cite this