TY - GEN
T1 - Fault-aware job scheduling for BlueGene/L systems
AU - Oliner, A. J.
AU - Sahoo, R. K.
AU - Moreira, J. E.
AU - Gupta, M.
AU - Sivasubramaniam, Anand
PY - 2004
Y1 - 2004
N2 - Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures that can affect system performance. In this paper evaluate the effectiveness of a previously developed job scheduling algorithm for BlueGene/L in the presence of faults. We have developed two new job-scheduling algorithms considering failures while scheduling the jobs. We have also evaluated the impact of these algorithms on average bounded slowdown, overage response time and system utilization, considering different levels of proactive failure prediction and prevention techniques reported in the literature. Our simulation studies show that the use of these new algorithms with even trivial fault prediction confidence or accuracy levels (as low as 10%) can significantly improve the performance of the BlueGene/L system.
AB - Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures that can affect system performance. In this paper evaluate the effectiveness of a previously developed job scheduling algorithm for BlueGene/L in the presence of faults. We have developed two new job-scheduling algorithms considering failures while scheduling the jobs. We have also evaluated the impact of these algorithms on average bounded slowdown, overage response time and system utilization, considering different levels of proactive failure prediction and prevention techniques reported in the literature. Our simulation studies show that the use of these new algorithms with even trivial fault prediction confidence or accuracy levels (as low as 10%) can significantly improve the performance of the BlueGene/L system.
UR - http://www.scopus.com/inward/record.url?scp=12444257746&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=12444257746&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:12444257746
SN - 0769521320
SN - 9780769521329
T3 - Proceedings - International Parallel and Distributed Processing Symposium, IPDPS 2004 (Abstracts and CD-ROM)
SP - 899
EP - 908
BT - Proceedings - 18th International Parallel and Distributed Processing Symposium, IPDPS 2004 (Abstracts and CD-ROM)
T2 - Proceedings - 18th International Parallel and Distributed Processing Symposium, IPDPS 2004 (Abstracts and CD-ROM)
Y2 - 26 April 2004 through 30 April 2004
ER -