TY - GEN
T1 - Rain or Shine? - Making Sense of Cloudy Reliability Data
AU - Narayanan, Iyswarya
AU - Sharma, Bikash
AU - Wang, Di
AU - Govindan, Sriram
AU - Caulfield, Laura
AU - Sivasubramaniam, Anand
AU - Kansal, Aman
AU - Liu, Jie
AU - Khessib, Badriddine
AU - Vaid, Kushagra
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/7/13
Y1 - 2017/7/13
N2 - Cloud datacenters must ensure high availability for the hosted applications and failures can be the bane of datacenter operators. Understanding the what, when and why of failures can help tremendously to mitigate their occurrence and impact. Failures can, however, depend on numerous spatial and temporal factors spanning hardware, workloads, support facilities, and even the environment. One has to rely on failure data from the field to quantify the influence of these factors on failures. Towards this goal, we collect failures data along with many parameters that might influence failures from two large production datacenters with very diverse characteristics. We show that multiple factors simultaneously affect failures, and these factors may interact in non-trivial ways. This makes conventional approaches that study aggregate characteristics or single parameter influences, rather inaccurate. Instead, we build a multi-factor analysis framework to systematically identify influencing factors, quantify their relative impact, and help in more accurate decision making for failure mitigation. We demonstrate this approach for three important decisions: spare capacity provisioning, comparing the reliability of hardware for vendor selection, and quantifying flexibility in datacenter climate control for cost-reliability trade-offs.
AB - Cloud datacenters must ensure high availability for the hosted applications and failures can be the bane of datacenter operators. Understanding the what, when and why of failures can help tremendously to mitigate their occurrence and impact. Failures can, however, depend on numerous spatial and temporal factors spanning hardware, workloads, support facilities, and even the environment. One has to rely on failure data from the field to quantify the influence of these factors on failures. Towards this goal, we collect failures data along with many parameters that might influence failures from two large production datacenters with very diverse characteristics. We show that multiple factors simultaneously affect failures, and these factors may interact in non-trivial ways. This makes conventional approaches that study aggregate characteristics or single parameter influences, rather inaccurate. Instead, we build a multi-factor analysis framework to systematically identify influencing factors, quantify their relative impact, and help in more accurate decision making for failure mitigation. We demonstrate this approach for three important decisions: spare capacity provisioning, comparing the reliability of hardware for vendor selection, and quantifying flexibility in datacenter climate control for cost-reliability trade-offs.
UR - http://www.scopus.com/inward/record.url?scp=85027264757&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85027264757&partnerID=8YFLogxK
U2 - 10.1109/ICDCS.2017.103
DO - 10.1109/ICDCS.2017.103
M3 - Conference contribution
AN - SCOPUS:85027264757
T3 - Proceedings - International Conference on Distributed Computing Systems
SP - 218
EP - 229
BT - Proceedings - IEEE 37th International Conference on Distributed Computing Systems, ICDCS 2017
A2 - Lee, Kisung
A2 - Liu, Ling
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 37th IEEE International Conference on Distributed Computing Systems, ICDCS 2017
Y2 - 5 June 2017 through 8 June 2017
ER -