How service discovery works in distributed systems

· Category: System Design

Short answer

Service discovery enables clients to locate healthy instances of services without hardcoding addresses, supporting elastic scaling and failover.

Steps

  1. Services register themselves with a registry on startup with metadata and health status.
  2. Clients query the registry to obtain a list of available instances.
  3. The registry monitors health and removes failed instances.
  4. Clients cache instance lists and refresh them periodically.
  5. On shutdown, services deregister themselves cleanly.

Tips

  • Use client-side discovery for lower latency or server-side discovery with load balancers.
  • Combine DNS-based discovery for simplicity with API-based registries for detail.
  • Secure the registry to prevent malicious registration of fake instances.
  • Use sidecar proxies like Envoy to handle discovery transparently.

Common issues

  • Stale cache entries pointing to dead instances.
  • Registry outages preventing all service communication.
  • Split-brain in clustered registries.
  • Thundering herd on registry refresh after a restart.

Example

# Consistent hashing for service discovery
import hashlib

def get_node(key, nodes):
    hash_val = int(hashlib.md5(key.encode()).hexdigest(), 16)
    return nodes[hash_val % len(nodes)]

node = get_node('user-123', ['node-a', 'node-b', 'node-c'])

This snippet implements consistent hashing to distribute keys across nodes, a foundational technique in scalable distributed systems.

Additional context

Applying these principles consistently across projects leads to more maintainable systems, clearer team communication, and better outcomes for end users. Regular review and refinement of practices ensure continuous improvement.