Collaborative Research: CSR---SMA+AES: Pro-Active Runtime Health Enhancement of Large-Scale Parallel Systems Using PROGNOSIS

Project: Research project

Project Details

Description

Large scale parallel systems are critical to take on the challenges

imposed by highly demanding applications of critical importance. Pushing the

limits of hardware and software technologies to extract the maximum

performance can increase their susceptibility to failures. This arises as a

consequence of growing hardware transient errors, hardware device failures,

and software complexity. These failures can have substantial consequences

on system performance, and add to the costs of maintenance/operation,

thereby putting at risk the very motivation behind deploying these large

scale systems. Rather than treat failures as an exception and take

reactive remedies, this project intends to anticipate their occurrence

and take pro-active runtime measures to hide their impact.

This research is expected to make three broad contributions towards

developing a runtime fault-tolerance infrastructure.

The first set of contributions is on collecting and analyzing

system events from an actual BlueGene/L system over an

extended period of time. The second set of contributions are models for

online analysis and prediction of evolving failure data.

The third set of contributions are on failure-aware parallel job

scheduling and checkpointing. On the educational front, in addition to

enhancing graduate curriculum and research, this project intends to involve

undergraduate students and women. The tools developed in this project and the

related results will be made available in public domain and published in

leading journals/conferences. In addition, the PIs will also push these

tools to be incorporated on actual systems, to enhance their fault-tolerance

abilities.

StatusFinished
Effective start/end date8/15/067/31/11

Funding

  • National Science Foundation: $356,860.00

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.