TY - JOUR
T1 - NetCSI
T2 - A Generic Fault Diagnosis Algorithm for Large-Scale Failures in Computer Networks
AU - Tati, Srikar
AU - Rager, Scott
AU - Ko, Bong Jun
AU - Cao, Guohong
AU - Swami, Ananthram
AU - Porta, Thomas La
N1 - Publisher Copyright:
© 2004-2012 IEEE.
PY - 2016/5/1
Y1 - 2016/5/1
N2 - We present a framework and a set of algorithms for determining faults in networks when large scale outages occur. The design principles of our algorithm, netCSI, are motivated by the fact that failures are geographically clustered in such cases. We address the challenge of determining faults with incomplete symptom information due to a limited number of reporting nodes. netCSI consists of two parts: a hypotheses generation algorithm, and a ranking algorithm. When constructing the hypothesis list of potential causes, we make novel use of positive and negative symptoms to improve the precision of the results. In addition, we propose pruning and thresholding along with a dynamic threshold value selector, to reduce the complexity of our algorithm. The ranking algorithm is based on conditional failure probability models that account for the geographic correlation of the network objects in clustered failures. We evaluate the performance of netCSI for networks with both random and realistic topologies. We compare the performance of netCSI with an existing fault diagnosis algorithm, MAX-COVERAGE, and demonstrate an average gain of 128 percent in accuracy for realistic topologies.
AB - We present a framework and a set of algorithms for determining faults in networks when large scale outages occur. The design principles of our algorithm, netCSI, are motivated by the fact that failures are geographically clustered in such cases. We address the challenge of determining faults with incomplete symptom information due to a limited number of reporting nodes. netCSI consists of two parts: a hypotheses generation algorithm, and a ranking algorithm. When constructing the hypothesis list of potential causes, we make novel use of positive and negative symptoms to improve the precision of the results. In addition, we propose pruning and thresholding along with a dynamic threshold value selector, to reduce the complexity of our algorithm. The ranking algorithm is based on conditional failure probability models that account for the geographic correlation of the network objects in clustered failures. We evaluate the performance of netCSI for networks with both random and realistic topologies. We compare the performance of netCSI with an existing fault diagnosis algorithm, MAX-COVERAGE, and demonstrate an average gain of 128 percent in accuracy for realistic topologies.
UR - http://www.scopus.com/inward/record.url?scp=84969930799&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84969930799&partnerID=8YFLogxK
U2 - 10.1109/TDSC.2014.2369051
DO - 10.1109/TDSC.2014.2369051
M3 - Article
AN - SCOPUS:84969930799
SN - 1545-5971
VL - 13
SP - 355
EP - 368
JO - IEEE Transactions on Dependable and Secure Computing
JF - IEEE Transactions on Dependable and Secure Computing
IS - 3
M1 - 6951396
ER -