Project Details
Description
Large scale parallel systems are critical to take on the challenges
imposed by highly demanding applications of critical importance. Pushing the
limits of hardware and software technologies to extract the maximum
performance can increase their susceptibility to failures. This arises as a
consequence of growing hardware transient errors, hardware device failures,
and software complexity. These failures can have substantial consequences
on system performance, and add to the costs of maintenance/operation,
thereby putting at risk the very motivation behind deploying these large
scale systems. Rather than treat failures as an exception and take
reactive remedies, this project intends to anticipate their occurrence
and take pro-active runtime measures to hide their impact.
This research is expected to make three broad contributions towards
developing a runtime fault-tolerance infrastructure.
The first set of contributions is on collecting and analyzing
system events from an actual BlueGene/L system over an
extended period of time. The second set of contributions are models for
online analysis and prediction of evolving failure data.
The third set of contributions are on failure-aware parallel job
scheduling and checkpointing. On the educational front, in addition to
enhancing graduate curriculum and research, this project intends to involve
undergraduate students and women. The tools developed in this project and the
related results will be made available in public domain and published in
leading journals/conferences. In addition, the PIs will also push these
tools to be incorporated on actual systems, to enhance their fault-tolerance
abilities.
Status | Finished |
---|---|
Effective start/end date | 8/15/06 → 7/31/11 |
Funding
- National Science Foundation: $356,860.00