articleAug 15, 2011GOLD OA

Understanding network failures in data centers

University of Toronto · Microsoft (United States)

Indexed incrossref

Abstract

We present the first large-scale analysis of failures in a data center network. Through our analysis, we seek to answer several fundamental questions: which devices/links are most unreliable, what causes failures, how do failures impact network traffic and how effective is network redundancy? We answer these questions using multiple data sources commonly collected by network operators. The key findings of our study are that (1) data center networks show high reliability, (2) commodity switches such as ToRs and AggS are highly reliable, (3) load balancers dominate in terms of failure occurrences with many short-lived software related faults,(4) failures have potential to cause loss of many small packets such as…

Citation impact

714
total citations
FWCI
95.55
Percentile
100%
References
29
Citations per year

Authors

3

Topics & keywords

Keywords
  • Redundancy (engineering)
  • Computer science
  • Data center
  • Computer network
  • Network packet
  • Reliability (semiconductor)
  • Data redundancy
  • Networking hardware
No related works found for this paper.