How to measure and improve system availability

· Category: System Design

Short answer

Availability measures the proportion of time a system is operational and reachable, typically expressed as a percentage such as 99.9 percent.

Steps

  1. Define availability targets in service level agreements based on business impact.
  2. Measure uptime with synthetic probes and real user monitoring.
  3. Eliminate single points of failure by deploying redundant components across zones.
  4. Implement automated failover with health checks and load balancer integration.
  5. Practice disaster recovery drills to validate recovery time objectives.

Tips

  • Calculate error budgets to balance velocity against reliability.
  • Use canary deployments to detect issues before they affect all users.
  • Maintain runbooks and on-call rotations for rapid incident response.
  • Architect for graceful degradation when full availability is impossible.

Common issues

  • Over-optimizing for availability at unsustainable cost.
  • Ignoring planned maintenance windows in availability calculations.
  • Cascading failures where redundancy mechanisms themselves fail.
  • Misconfigured health checks causing premature failovers.

Example

# Consistent hashing for service discovery
import hashlib

def get_node(key, nodes):
    hash_val = int(hashlib.md5(key.encode()).hexdigest(), 16)
    return nodes[hash_val % len(nodes)]

node = get_node('user-123', ['node-a', 'node-b', 'node-c'])

This snippet implements consistent hashing to distribute keys across nodes, a foundational technique in scalable distributed systems.