Coordination in Distributed Systems

Modern distributed systems are built from many moving parts - databases, services, queues, caches, and schedulers - all operating concurrently and often out of sync. The only reason this doesn’t collapse into chaos is coordination.

Coordination is how independent components stay aligned - who leads, who follows, who owns which task, and what to do when something goes wrong. It’s the invisible contract that keeps the system from acting against itself.

But coordination takes different shapes depending on where it lives in the system. At the control plane, it appears as consensus, preventing disagreements before they occur. At the application plane, it takes the form of compensating protocols like SAGA, healing disagreements once they’ve already happened.

You might wonder - what about the data plane?
At data plane, the goal isn’t coordination but correctness, achieved either through agreement on the order of operations using consensus, or agreement on the final value using reconciliation.

Coordination Through Consensus

In my earlier blog, we explored how consensus underpins the control plane, giving a cluster of nodes a shared, ordered view of reality. Without it, even the simplest system would fracture under ambiguity. Nodes wouldn’t know who the leader is, who owns which partition, or what configuration is current.

But the role of consensus in the control plane extends far beyond database replication. It’s what keeps large microservice ecosystems coherent too. Imagine an e-commerce platform with separate Order, Payment, and Inventory services. Each runs independently, yet they must coordinate to behave as one system: only one instance should process payments, only one should manage stock allocation, but all should agree on rollout or configuration changes. Consensus provides the foundation for this coordination. It ensures that leadership, membership, and configuration decisions are made once and in one order, giving the entire system a single, consistent view of control.

Coordination Through Compensation

At the application plane, coordination takes on a different meaning.

Here, services don’t share a log or a leader, they only share intent. Each service owns its own data and autonomy, which means global atomicity is no longer possible. Failures are inevitable, so instead of preventing inconsistency, the goal becomes recovering gracefully when it occurs.

This is where compensating protocols come in. Rather than enforcing a single, ordered transaction across services, the system coordinates through local transactions and explicit compensations that restore consistency when steps fail.

Take an example from e-commerce:
1. The Order Service creates an order.
2. The Payment Service charges the customer.
3. The Inventory Service reserves stock.

If the third step fails i.e., in the event of stock being unavailable, the system can’t roll back the first two atomically. There’s no shared transaction log spanning all services. Instead, the Order Service cancels the order, and the Payment Service issues a refund. The failure is resolved not by consensus, but by compensation.

This pattern - the Saga - coordinates long-lived, cross-service workflows without global locks or consensus. Each local transaction commits independently, and compensating steps undo side effects when needed.

Over time, this approach has evolved into several flavours:
• Orchestration-based Sagas - a central coordinator drives the workflow and triggers compensations.
• Choreography-based Sagas - each service listens for events and reacts autonomously.
• TCC (Try–Confirm–Cancel) - a structured variant common in booking and finance systems, where each service explicitly reserves, confirms, or cancels resources.

At the application plane, coordination is no longer about agreement, it’s about alignment after disagreement. Where consensus avoids conflict through order, compensation restores order after conflict.