A distributed system may be capable of handling millions of requests, but that capability is useless if the entire platform stops working when a single server fails. Managing high traffic and ensuring fault tolerance are two separate challenges. The concept that addresses fault tolerance is known as Availability.
Availability represents the amount of time a system remains operational, accessible, and responsive to users. A highly available system is designed to continue functioning even when some of its components fail. Instead of allowing one failure to bring down the entire platform, the system isolates the problem and keeps serving user requests.
1. Understanding Availability
Availability and reliability are often used interchangeably, but they have different meanings in system design.
Availability measures uptime. It answers the question: Can users access the system right now?
Reliability measures correctness. It answers the question: Is the system consistently producing accurate results?
A system can be highly available if it remains online and responds to requests, yet still be unreliable if it frequently returns errors or incorrect data. Similarly, a reliable system may provide correct results whenever it runs, but if it experiences frequent downtime, its availability will be poor.
2. Calculating Availability and the "Nines"
Availability is calculated using the following formula:
Availability = Uptime / (Uptime + Downtime)
For example, if a system remains operational for 364 days in a year and experiences one day of downtime:
Availability = 364 / 365 = 99.73%
To categorize system uptime, engineers use the concept of "Nines" of Availability. Each additional nine significantly reduces the amount of acceptable downtime.
| Availability | Maximum Downtime Per Year | Typical Use Case |
|---|---|---|
| 99% (Two Nines) | 3.65 Days | Internal or non-critical applications |
| 99.9% (Three Nines) | 8.77 Hours | Consumer-facing web applications |
| 99.99% (Four Nines) | 52.6 Minutes | Enterprise-grade services |
| 99.999% (Five Nines) | 5.26 Minutes | Mission-critical systems such as banking and healthcare |
Achieving higher availability becomes increasingly difficult and expensive because the allowable downtime becomes extremely small.
3. Improving Availability Through Redundancy
When services depend on one another, overall availability decreases.
For example:
Service A Availability = 99.9%
Service B Availability = 99.9%
If both services are required to complete a request, the combined availability becomes:
99.9% × 99.9% = 99.8%
To improve availability, system architects introduce redundancy by running multiple instances of the same service.
Imagine two identical application servers running behind a load balancer, each with 99.9% availability. The system will only fail if both servers fail simultaneously.
Since the failure probability of each server is:
0.1% = 0.001
The probability that both fail together is:
0.001 × 0.001 = 0.000001 (0.0001%)
Therefore:
System Availability = 100% − 0.0001% = 99.9999%
This simple redundancy dramatically improves availability, moving from three nines to nearly six nines.
4. Common Causes of System Failure
To build highly available systems, engineers must understand the major causes of outages.
Hardware Failures
Physical components naturally wear out over time. Hard drives, memory modules, power supplies, and networking equipment can all fail unexpectedly.
Software Bugs and Resource Exhaustion
Software defects may cause crashes or performance degradation. Common examples include memory leaks, deadlocks, excessive CPU consumption, and unhandled exceptions.
Network Issues
Networks are unpredictable and can experience packet loss, high latency, DNS failures, or network partitions that isolate parts of a distributed system.
Cascading Failures
A failure in one component can trigger failures elsewhere. For example, if one server crashes, its traffic may be redirected to another server, causing overload and potentially leading to a chain reaction that affects the entire system.
5. Redundancy and Standby Models
Organizations use different standby strategies depending on cost, recovery requirements, and business needs.
A. Cold Standby
In a cold standby setup, the backup server remains powered off until required.
Failover Process:
Start the backup server.
Deploy the application.
Apply configurations.
Restore or attach data.
Recovery Time: Several minutes to several hours.
Advantages:
Lowest infrastructure cost.
Suitable for disaster recovery.
Disadvantages:
Slow recovery.
Not ideal for production systems requiring continuous availability.
B. Warm Standby
In a warm standby setup, the backup server remains powered on and configured but does not actively serve user requests.
Failover Process:
Traffic is redirected to the standby server when the primary server fails.
Recovery Time: A few seconds to a few minutes.
Advantages:
Faster recovery than cold standby.
Moderate operational cost.
Disadvantages:
Some downtime may still occur during failover.
C. Hot Standby (Active-Active)
In a hot standby architecture, multiple servers actively process requests at the same time.
Failover Process:
Failed nodes are automatically removed from the load balancer.
Remaining nodes continue serving traffic without interruption.
For databases, this often requires synchronous replication, where data is written to multiple nodes before confirming success.
Recovery Time: Milliseconds to a few seconds.
Advantages:
Near-zero downtime.
Highest availability.
Disadvantages:
Highest infrastructure and maintenance costs.
6. Geographic Redundancy: Availability Zones and Regions
The physical location of infrastructure plays an important role in availability planning.
Availability Zones (AZs)
Availability Zones are separate data centers within the same geographic region. Each zone has independent power, cooling, and networking infrastructure.
Because they are located close together, data can be replicated quickly with minimal latency.
Multi-Region Deployments
Multi-region deployments place infrastructure across different geographic regions, often separated by hundreds or thousands of miles.
This approach protects against large-scale disasters and regional outages. However, the distance between regions introduces network delays, making synchronous replication difficult. As a result, asynchronous replication and eventual consistency are commonly used.
Summary
Availability measures how long a system remains accessible and operational.
Reliability focuses on the correctness of system behavior.
Redundancy increases availability by removing single points of failure.
System failures can result from hardware problems, software issues, network disruptions, or cascading failures.
Cold, Warm, and Hot Standby models offer different trade-offs between cost and recovery speed.
Availability Zones and Multi-Region deployments provide protection against infrastructure and regional failures.
High availability is a fundamental requirement in modern distributed systems because it ensures continuous service even when individual components fail.