Failure data analysis of a large-scale heterogeneous server environment

Ramendra K. Sahoo, Mark S. Squillante, Anand Sivasubramaniam, Yanyong Zhang

Research output: Contribution to conferencePaperpeer-review

192 Scopus citations


The growing complexity of hardware and software mandates the recognition of fault occurrence in system deployment and management. While there are several techniques to prevent and/or handle faults, there continues to be a growing need for an in-depth understanding of system errors and failures and their empirical and statistical properties. This understanding can help evaluate the effectiveness of different techniques for improving system availability, in addition to developing new solutions. In this paper, we analyze the empirical and statistical properties of system errors and failures from a network of nearly 400 heterogeneous servers running a diverse workload over a year. While improvements in system robustness continue to limit the number of actual failures to a very small fraction of the recorded errors, the failure rates are significant and highly variable. Our results also show that the system error and failure patterns are comprised of time-varying behavior containing long stationary intervals. These stationary intervals exhibit various strong correlation structures and periodic patterns, which impact performance but also can be exploited to address such performance issues.

Original languageEnglish (US)
Number of pages10
StatePublished - 2004
Event2004 International Conference on Dependable Systems and Networks - Florence, Italy
Duration: Jun 28 2004Jul 1 2004


Other2004 International Conference on Dependable Systems and Networks

All Science Journal Classification (ASJC) codes

  • Computer Science (miscellaneous)
  • Computer Networks and Communications


Dive into the research topics of 'Failure data analysis of a large-scale heterogeneous server environment'. Together they form a unique fingerprint.

Cite this