Performance implications of failures in large-scale cluster scheduling

Yanyong Zhang, Mark S. Squillante, Anand Sivasubramaniam, Ramendra K. Sahoo

Research output: Contribution to journalConference articlepeer-review

49 Scopus citations

Abstract

As we continue to evolve into large-scale parallel systems, many of them employing hundreds of computing engines to take on mission-critical roles, it is crucial to design those systems anticipating and accommodating the occurrence of failures. Failures become a commonplace feature of such large-scale systems, and one cannot continue to treat them as an exception. Despite the current and increasing importance of failures in these systems, our understanding of the performance impact of these critical issues on parallel computing environments is extremely limited. In this paper we develop a general failure modeling framework based on recent results from large-scale clusters and then we exploit this framework to conduct a detailed performance analysis of the impact of failures on system performance for a wide range of scheduling policies. Our results demonstrate that such failures can have a significant impact on the mean job response time and mean job slowdown under existing scheduling policies that ignore failures. We therefore investigate different scheduling mechanisms and policies to address these performance issues. Our results show that periodic checkpointing of jobs seems to do little to ease this problem. On the other hand, we demonstrate that information about the spatial and temporal correlation of failure occurrences can be very useful in designing a scheduling (job allocation) strategy to enhance system performance, with the former providing the greatest benefits.

Original languageEnglish (US)
Pages (from-to)233-252
Number of pages20
JournalLecture Notes in Computer Science
Volume3277
DOIs
StatePublished - Jan 1 2005
Event10th International Workshop on Job Scheduling Strategies for Parallel Processing, JSSPP 2004 - New York, NY, United States
Duration: Jun 13 2004Jun 13 2004

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint

Dive into the research topics of 'Performance implications of failures in large-scale cluster scheduling'. Together they form a unique fingerprint.

Cite this