How to measure and improve system availability
· Category: System Design
Short answer
Availability measures the proportion of time a system is operational and reachable, typically expressed as a percentage such as 99.9 percent.
Steps
- Define availability targets in service level agreements based on business impact.
- Measure uptime with synthetic probes and real user monitoring.
- Eliminate single points of failure by deploying redundant components across zones.
- Implement automated failover with health checks and load balancer integration.
- Practice disaster recovery drills to validate recovery time objectives.
Tips
- Calculate error budgets to balance velocity against reliability.
- Use canary deployments to detect issues before they affect all users.
- Maintain runbooks and on-call rotations for rapid incident response.
- Architect for graceful degradation when full availability is impossible.
Common issues
- Over-optimizing for availability at unsustainable cost.
- Ignoring planned maintenance windows in availability calculations.
- Cascading failures where redundancy mechanisms themselves fail.
- Misconfigured health checks causing premature failovers.
Example
# Consistent hashing for service discovery
import hashlib
def get_node(key, nodes):
hash_val = int(hashlib.md5(key.encode()).hexdigest(), 16)
return nodes[hash_val % len(nodes)]
node = get_node('user-123', ['node-a', 'node-b', 'node-c'])
This snippet implements consistent hashing to distribute keys across nodes, a foundational technique in scalable distributed systems.