Consensus is what turns a collection of replicas into a single, coherent truth. But it’s also what slows that truth down.
The scope of Consensus must be chosen carefully, because it comes with a cost. Without restraint, no system can remain both available and performant. When correctness depends on order, consensus is essential but when the environment itself guarantees that order, it may not be worth the weight. This boundary between where consensus is necessary and where it can be safely replaced by controlled synchrony is what truly defines how modern distributed databases are built, as we will see in the following discussion.
Consensus: The Backbone of Distributed Coordination
Modern systems rarely run on a single machine. They operate as clusters of servers that must work together, sharing responsibilities, recovering from failures, and scaling as one logical whole. But this coordination isn’t possible unless the nodes stay in sync about what’s happening: who’s the leader, who owns which partition, which configuration is current, etc.
That shared understanding doesn’t happen automatically. It’s the result of consensus. Consensus gives a cluster a common reality, ensuring that all nodes make decisions in the same order and interpret the system’s state the same way. That ordering isn’t incidental, it’s what defines correctness at the system level. Without it, even the simplest cluster would fracture under ambiguity.
Every distributed system needs one coherent source of truth to anchor its coordination logic. Otherwise, leadership, membership, and configuration can all diverge, leaving each node acting on a different version of reality.
Consensus and Databases
As discussed in my earlier post, architectures like single-leader, multi-leader, and leaderless replication make peace with temporary inconsistencies through reconciliation. But that peace is only possible because consensus holds the control plane together:
• Single-leader systems rely on control-plane consensus (via ZooKeeper, etcd, or built-in election) to guarantee that only one node ever acts as leader, preventing split-brain even though replication itself is asynchronous.
• Multi-leader systems depend on the control plane to define replication topology and membership, allowing each region to accept writes independently while staying consistent in configuration.
• Leaderless systems such as Dynamo or Cassandra use control-plane consensus to fix ring membership and partition ownership, ensuring that even when replicas diverge on values, they still agree on who is responsible for reconciling them.
Without consensus, the control plane itself would diverge, leaving the data plane in chaos. This is why systems like ZooKeeper, etcd, and Consul exist. They don’t hold data, they hold decisions. And those decisions make distributed architectures such as single-leader, multi-leader, and leaderless possible in the first place.
Consensus and Microservices
Consensus isn’t only for databases. It’s also what makes large microservice systems stay coherent. Microservices depend on consensus in the control plane for coordination and safety, not for shared data.
They use it to:
• Service discovery: which instance of a service is alive and reachable (often tracked in systems like Consul or etcd).
• Leader election: deciding which service instance performs singleton work like batch jobs or cache invalidation.
• Configuration management: ensuring all services agree on the latest configuration or feature flag rollout.
• Distributed locks or leases: preventing multiple instances from performing the same operation (e.g., double-sending emails or re-processing messages).
Here, consensus ensures agreement on control decisions, not on application data. The data plane - each service’s API and database - remains independent, scaling freely without needing global ordering.
The Scope of Consensus in the Data Plane
Once we descend into the data plane, the nature of agreement changes. It’s no longer about coordination, but about the meaning of correctness itself. Does correctness depend only on the final values being right or also on the order of operations that produced them?
This is the boundary line that decides whether a system can rely on reconciliation or must pay the price of consensus.
Type of correctness | Typical approach | Example systems |
---|---|---|
Value-based correctness (final state matters) | Reconciliation / eventual convergence | Cassandra, DynamoDB, multi-leader replication |
Order-based correctness (sequence defines correctness) | Consensus or single-leader synchrony | Spanner, CockroachDB, Oracle, financial ledgers |
If your only goal is to make sure everyone eventually agrees on the final value, reconciliation is enough. But if your goal is to make sure everyone agrees on the order that produced that value, reconciliation can’t help you - you need consensus, or a trusted single leader.
When Final Values Are Enough
Not every workload needs total order. In many systems, correctness depends only on the final value, not how it was reached. That means replicas can diverge temporarily and reconcile later.
Take these examples:
1. Counting Page Views or Likes
If two users like the same post at the same time, the order of increments doesn’t change the final count: +1 then +1 and +1 then +1 both yield +2.
Each replica can apply the increment independently. If updates collide, reconciliation just merges their deltas.
2. Product Ratings and Aggregates
If two users submit ratings at the same time, it doesn’t matter which rating arrived first, the final average and vote count will converge once all updates are applied.
Replicas can even compute partial aggregates independently and later reconcile by merging their totals and vote counts. Since averaging is commutative, the order of updates doesn’t affect the result.
3. Shopping Carts
Each replica might receive different updates (add/remove) for the same cart. At merge time, you can apply deterministic rules:
• Set union for additions,
• Set difference for removals,
• or last-write-wins based on timestamp.
No global order is required, only a merge policy that deterministically combines updates.
4. Session or Activity Logs
Logs are append-only; new rows don’t overwrite previous ones. If two replicas receive inserts concurrently, they can both safely persist them. During reconciliation, duplicate detection or timestamp sorting produces a consistent timeline.
As can be seen through these examples, the system can afford to be “eventually right,” because the order of updates carries no semantic weight. Here, consensus would only slow things down for no real benefit.
When Order Defines Correctness
When workloads do depend on the correctness guarantee that is defined by a single, agreed history and not just a matching final state, consensus is required in data plane.
Take these examples:
1. Financial Transactions — Atomic Transfers
When a transaction touches two different shards (each holding one account). The system must ensure:
• Both updates succeed or both abort (atomicity)
• All replicas agree on whether the transfer happened (consistency)
• The debit happens before the credit (ordering)
If one replica commits while another aborts, balances diverge permanently.
To prevent this, systems like Spanner and CockroachDB use Raft or Paxos to replicate each shard’s log and then coordinate the commit decision via consensus.
2. Global Uniqueness — Preventing Duplicate Inserts
Uniqueness is a global invariant. There must never be two rows with the same key. In a sharded system, if different nodes independently accept inserts and reconcile later, duplicates can sneak in.
Consensus ensures that all replicas agree on which insert “won” before it’s visible.
3. Ordering-Sensitive Operations — Ledgers, Logs, and Histories
Order defines meaning! Same operations applied in a different order yield different balances. For example, “withdraw $100 then deposit $50” ≠ “deposit $50 then withdraw $100”.
If each replica accepts writes independently, their ledgers may diverge in sequence, even if they eventually converge in value. A Raft or Paxos log ensures that all replicas append entries in the same order.
4. Serialisable Multi-Statement Transactions
To guarantee serialisable isolation, the system must ensure that between the SELECT and the UPDATE, no concurrent transaction changed the same balance.
In a distributed setting, this requires coordination among replicas to decide the visible state, effectively a per-shard consensus on read timestamps and commit order.
Consensus is mandatory when transactions span both reads and writes whose correctness depends on serial order.
5. Monotonic Global Identifiers
If every node increments global_seq independently, IDs collide or regress.
Consensus ensures that replicas agree on the “next” value and commit that decision atomically before issuing it.
6. Referential Integrity Across Shards
To enforce the FOREIGN KEY (user_id) constraint, the system must know whether user 42 truly exists now, not eventually.
Without consensus, one shard could delete the user while another inserts the order. With consensus, the system serialises referential checks across shards, guaranteeing a consistent snapshot.
The Middle Ground: Single-Leader Synchrony
Consensus isn’t the only way to preserve order. Single-leader databases like Oracle, PostgreSQL, and MySQL achieve the same correctness without Raft or Paxos, not by negotiating agreement, but by controlling the environment in which agreement holds naturally.
This control comes from three key guarantees working together:
1. Synchronous replication - Every commit waits for at least one standby to confirm it before the leader acknowledges success. No transaction is considered complete until it’s safely persisted on another node.
2. Controlled failover - External orchestration tools such as Oracle Data Guard Broker or Patroni with etcd ensure that only a fully caught-up replica can ever take over. A lagging standby is never promoted, eliminating the risk of data loss (also discussed here).
3. Single-region reliability - Because these replicas typically operate within the same low-latency network, commit acknowledgements and ordering decisions happen almost instantly, keeping the system effectively in lockstep.
Together, these constraints make consensus unnecessary, but only within the boundaries of a tightly controlled, synchronous environment. The environment itself guarantees the order by engineering away uncertainity that consensus would otherwise have to enforce. Step outside those boundaries i.e., across datacentres, across shards, across unpredictable networks and this safety net unravels. The moment synchrony can’t be trusted, consensus becomes indispensable.
That’s why large-scale, geo-distributed databases like Spanner, CockroachDB, or YugabyteDB embed consensus directly into the data plane: they can’t rely on the environment to stay ordered, so they formalise order instead.
The real takeaway here isn’t that one approach is superior. It’s that each secures order differently. When order defines correctness, a system must still ensure one globally accepted history either by protocol, or by environment.