Erich P. Stuntebeck, John S. Davis II, et al.
HotMobile 2008
The growing computational and storage needs of several scientific applications mandate the deployment of extreme-scale parallel machines, such as IBM's BlueGene/L which can accommodate as many as 128K processors. In this paper, we present our experiences in collecting and filtering error event logs from a 8192 processor BlueGene/L prototype at IBM Rochester, which is currently ranked #8 in the Top-500 list. We analyze the logs collected from this machine over a period of 84 days starting from August 26, 2004. We perform a three-step filtering algorithm on these logs: extracting and categorizing failure events; temporal filtering to remove duplicate reports from the same location; and finally coalescing failure reports of the same error across different locations. Using this approach, we can substantially compress these logs, removing over 99.96% of the 828,387 original entries, and more accurately portray the failure occurrences on this system. © 2005 IEEE.
Erich P. Stuntebeck, John S. Davis II, et al.
HotMobile 2008
Pradip Bose
VTS 1998
Raymond Wu, Jie Lu
ITA Conference 2007
Ehud Altman, Kenneth R. Brown, et al.
PRX Quantum