Filtering failure logs for a BlueGene/L prototype

Yinglung Liang, Yanyong Zhang, Anand Sivasubramaniam, Ramendra K. Sahoo, Jose Moreira, Manish Gupta

Research output: Contribution to conferencePaperpeer-review

113 Scopus citations

Abstract

The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBM's BlueGene/L which can accommodate as many as 128K processors. In this paper, we present our experiences in collecting and filtering error event logs from a 8192 processor BlueGene/L prototype at IBM Rochester, which is currently ranked #8 in the Top-500 list. We analyze the logs collected from this machine over a period of 84 days starting from August 26, 2004. We perform a three-step filtering algorithm on these logs: extracting and categorizing failure events; temporal filtering to remove duplicate reports from the same location; and finally coalescing failure reports of the same error across different locations. Using this approach, we can substantially compress these logs, removing over 99.96% of the 828,387 original entries, and more accurately portray the failure occurrences on this system.

Original languageEnglish (US)
Pages476-485
Number of pages10
StatePublished - 2005
Event2005 International Conference on Dependable Systems and Networks - Yokohama, Japan
Duration: Jun 28 2005Jul 1 2005

Other

Other2005 International Conference on Dependable Systems and Networks
Country/TerritoryJapan
CityYokohama
Period6/28/057/1/05

All Science Journal Classification (ASJC) codes

  • Computer Science (miscellaneous)
  • Computer Networks and Communications

Cite this