The Illusion of Time in Distributed Systems, Part I: Subtle Lie That Ticks Beneath Our Servers

Distributed systems often behave as if time were universal, objective truth as if two machines agreeing on a timestamp meant they also agreed on the order of events. But clocks in computers are not laws of nature! They drift with time, freeze under load, stumble over leap seconds, and lags behind through network delays.

This post, the first in a series, builds on my understanding of time’s unreliability, as explored in Designing Data-Intensive Applications, to examine how fragile clocks break our reasoning about order and why every serious distributed system eventually stops trusting physical time altogether.

But before we can talk about unreliable clocks, it’s worth asking just how much we rely on them in the first place. The short answer is, almost everywhere!

Where We Depend on Time

Distributed systems routinely use timestamps for:
• Ordering events - deciding which transaction happened first.
• Replication - “last write wins” conflict resolution.
• Coordination - sessions, leases, caches, and locks.
• Performance measurement - latency, timeout detection, and SLA enforcement.
• Monitoring and tracing - aligning logs across machines.

Each of these depends on the assumption that clocks are trustworthy but that assumption is dangerously false. Also not all uses of time are the equal. There’s a crucial difference between using time as a stopwatch and using time as truth:
• Performance measurement relies on monotonic clocks which only move forward. They’re ideal for measuring durations (like “how long this request took”) and don’t care what the real time of day is.
• Ordering, replication, coordination, monitoring and tracing on the other hand, rely on time-of-day clocks. These clocks try to stay in sync across machines. When two servers disagree on “what time it is,” they can disagree on which event happened first and that’s when things start to break!

The Many Ways Time Lies

Clock Drift and Skew

Every computer’s internal clock relies on a quartz crystal oscillator. It is a tiny hardware component that vibrates at a constant frequency to keep time. In theory, this vibration defines a precise second. In practice, every crystal ticks at a slightly different rate, and even small temperature changes alter its frequency.

A tiny error, say one part per million, adds up fast, leading to nearly 86 milliseconds of drift per day. Over hours or days, those small differences accumulate across machines into what we call clock skew, the gap between one node’s “now” and another’s.

Once that happens, timestamps stop being trustworthy. Two honest servers can record the same event with conflicting times. And in systems that rely on “last write wins” conflict resolution, that subtle skew can silently overwrite correct data with stale versions.

This drift and skew affect time-of-day clocks, the ones trying to stay in sync across machines. Monotonic clocks, used for measuring durations or timeouts, don’t need synchronisation. They can run slightly fast or slow, but since they’re never compared across nodes, that difference doesn’t cause skew.

The Fragile Fix: NTP

To fight this natural drift, systems turn to the Network Time Protocol (NTP). It’s the invisible background service that attempts to keeps clocks “roughly” in sync. But NTP is not magic. It only mitigates drift; it doesn’t eliminate it.

When a client detects that its local clock is off, it can correct it in one of two ways:
• Steps the clock - instant jump forward/backward, or
• Slews the rate - speeds up or slows down until in sync.

If it steps backward, timestamps literally move into the past. This is dangerous for applications depending on monotonic time progression (e.g., timeouts, caches, or logs).

NTP itself depends on timely and accurate network communication. When networks slow down or drop packets, synchronisation falters; time accuracy can drift for hours or even days before anyone notices.

To make things worse, public NTP servers aren’t always as reliable. Some are misconfigured or report wildly incorrect times, occasionally off by hours. To defend against this, NTP clients must poll multiple servers and discard outliers, but that still means placing trust in external, sometimes unverified, sources.

Because NTP only adjusts the time-of-day clock, the monotonic clock continues advancing smoothly, unaffected by these corrections. It remains steady even when wall time jumps backward.

When clocks drift beyond acceptable bounds, distributed systems must take drastic measures: any node whose clock diverges too far from the rest is evicted or isolated from the cluster until its time catches up, sacrificing availability for consistency.

When Time Jumps: Process Pauses

Even if a clock doesn’t drift, it can still freeze.

Modern computers are full of moving parts: schedulers that pre-empt threads, garbage collectors that halt execution, and hypervisors that suspend entire virtual machines. These pauses come in two flavours:
• In-process pauses (OS pre-emption, stop-the-world GC): The operating system temporarily pauses your thread or when a stop-the-world garbage collector suspends all threads, your program simply stops executing. The host’s monotonic clock keeps ticking, but your code isn’t running to observe or act on it. When the process resume, it discovers that more time has passed than it ever experienced.
• VM suspends (live migration, snapshot, contention): Virtual machines share physical CPU cores, and the hypervisor decides when each guest runs. If a VM is paused i.e., for live migration, snapshotting, or CPU contention, its virtualised clock stops entirely. When it resumes, the guest OS believes no time has passed while the rest of the cluster has moved on.

At first glance, these might seem harmless: the clock remained monotonic (i.e., progressed forward in time, even if time froze in case of VM pauses), after all. But distributed systems often depend on timely actions, not just ordered ones. No heartbeats sent, no leases renewed and no timers checked may quickly cascade into split-brain states, missed renewals, or data corruption.

A clock may never move backward, yet it can still stall, freeze, or leap ahead relative to other nodes. In a distributed system, one node’s harmless pause can look exactly like another node’s failure.

When Time Bends: Leap Seconds

Even if clocks are perfectly synchronised, the very definition of a “second” isn’t fixed to our planet. Atomic clocks measure time with astonishing precision, while Earth, we measure time against, rotates a little unevenly. To keep Coordinated Universal Time (UTC) aligned with the planet’s actual rotation, as represented by UT1, the International Earth Rotation Service occasionally adds or removes a second. This extra tick is known as a leap second.

At midnight on such a day, instead of jumping from 23:59:59 → 00:00:00, the clock briefly shows 23:59:60. It’s a small, one-second adjustment but it throws many systems into confusion.

Different operating systems and vendors handle it differently:
• Some ignore it entirely, skipping straight to the next minute and silently losing a second.
• Others smear it, spreading the extra second gradually over a few minutes or hours so time appears to flow smoothly.
• A few step it in abruptly, literally freezing at 23:59:59 for one second before resuming.

Each approach avoids one problem but creates another, small discontinuities in recorded time. Logs from different machines may disagree about the same second; metrics can briefly appear negative; time-based ordering can break in subtle ways. In systems that depend on precise temporal ordering such as databases, replication, event logs, even one inconsistent second can cause replays, duplicates, or confusion about causality.

The End of the Leap Second
In 2022, the International Bureau of Weights and Measures voted to abolish leap seconds by 2035. UTC will no longer be forced to stay within 0.9 seconds of Earth’s rotation. It will be allowed to simply drift away, a few seconds per century. The decision formalises what many systems already do in practice: smear time gently rather than jumping or freezing it. For computers, this means smoother, uninterrupted time; for humanity, it marks a quiet acceptance that our clocks no longer follow the planet.

When the Illusion Breaks

Distributed systems rely on time for ordering, replication, coordination, performance measurement, and monitoring. When clocks drift, freeze, or disagree, each of these foundations crack in peculiar ways:

Ordering Events: Impossible Orderings

Ordering by timestamp assumes every machine shares the same view of “now.”
When that assumption fails, event sequences become impossible.

Machine	Local Time	Event
Server A	10:00:00.050	Receives payment
Server B	10:00:00.020	Issues refund

To a human, the refund clearly came after the payment. To the system, timestamp comparison says otherwise, and chaos follows. This is how global ordering collapses into contradiction: correct locally, incoherent globally.

Replication: Lost Updates

Systems like Cassandra use last-write-wins to resolve conflicts. Each replica timestamps its write, and the latest one wins. But if their clocks differ, an older write can appear newer and silently overwrite a legitimate update. The cluster remains “consistent,” but only according to a distorted timeline.

Coordination: Expired Leases and Phantom Failures

Leases, locks, and heartbeats all depend on durations measured correctly.
• Expired leases: A node believes ten seconds have passed and delays renewal but only eight real seconds elapsed. Another node assumes the lease expired and takes over. Now two owners hold the same lock.
• Phantom failures: A paused process misses heartbeats or timeouts while others keep running. From the cluster’s view, it’s dead; from its own, it just blinked.

Coordination logic assumes progress marches in step across nodes. When time flows unevenly, cooperation turns into conflict.

Performance Measurement: False Latency and Delayed Timeouts

Performance metrics rely on measuring durations: how long did this take? If a thread stalls during a GC pause or a VM freeze, the monotonic clock keeps ticking while the process stops. When it resumes, latency appears inflated and timeouts may fire early or late. Even with perfect monotonicity, perception of time becomes distorted by pauses.

Monitoring and Tracing: False Spikes and Missing Data

Monitoring systems rely on timestamps to align logs and traces across machines. When clocks skew, these timelines desynchronise. One machine may report a spike before it even receives the request that caused it; another may appear idle because its clock runs ahead. Traces fracture; causal links blur.

Conclusion

Time in distributed systems is not absolute. It’s a fragile consensus built atop drifting, pausing, untrustworthy clocks. You can try to correct it, smear it, align across servers but you can’t make it universal.

The deeper truth is this that distributed systems don’t need a shared clock, they need a shared understanding of order. That understanding begins with logical clocks, which we’ll explore next in The Illusion of Time, Part II: How Systems Keep Order Without Real Time.

References