Data Center Feng Shui: Fault Tolerance and Fault Isolation

Like most architectural decisions the two goals do not require mutually exclusive decisions. 

The difference between fault isolation and fault tolerance is not necessarily intuitive. The differences, though subtle, are profound and have a substantial impact on data center architecture.

Fault tolerance is an attribute of systems and architecture that allow it to continue performing its tasks in the

event of a component failure. Fault tolerance of servers, for example, is achieved through the use of redundancy in power-supplies, in hard-drives, and in network cards. In an architecture, fault tolerance is also achieved through redundancy by deploying two of everything: two servers, two load balancers, two switches, two firewalls, two Internet connections. The fault tolerant architecture includes no single point of failure; no component that can fail and cause a disruption in service. load balancing, for example, is a fault tolerant-based strategy that leverages multiple application instances to ensure that failure of one instance does not impact the availability of the application.

Fault isolation on the other hand is an attribute of systems and architectures that isolates the impact of a failure such that only a single system, application, or component is impacted. Fault isolation allows that a component may fail as long as it does not impact the overall system. That sounds like a paradox, but it’s not. Many intermediary devices employ a “fail open” strategy as a method of fault isolation. When a network device is required to intercept data in order to perform its task – a common web application firewall configuration – it becomes a single point of failure in the data path. To mitigate the potential failure of the device, if something should fail and cause the system to crash it “fails open” and acts like a simple network bridge by simply forwarding packets on to the next device in the chain without performing any processing. If the same component were deployed in a fault-tolerant architecture, there would be deployed two devices and hopefully leveraging non-network based failover mechanisms.

Similarly, application infrastructure components are often isolated through a contained deployment model (like sandboxes) that prevent a failure – whether an outright crash or sudden massive consumption of resources – from impacting other applications. Fault isolation is of increasing interest as it relates to cloud computing environments as part of a strategy to minimize the perceived negative impact of shared network, application delivery network, and server infrastructure.


Published Jun 16, 2010
Version 1.0

Was this article helpful?

2 Comments

  • I have a VS with no http profile attached. 1 pool and 2 pool members A and B. A has priority 1 and becomes active member and B has priority 0 and becomes passive member. All the traffic goes to member A. ICMP help monitor is configured with interval 2 seconds and timeout 7 seconds. For example now the member A stopped responding But F5 will mark member A down after 7 seconds and after that passive member B will come up and start responding to traffic. But in between these 7 seconds all the new requests will lost as VS was sending traffic to A which was not responding and B was in passive state and VS was not sending traffic to B. What i want is that F5 keep new connections in queue for 7 seconds and send it to member B when VS mark member A down and start sending traffic to member B.