Skip to main content

Command Palette

Search for a command to run...

High-Watermark in Distributed Systems: A Deep Dive with Apache Kafka

Published
9 min read

Let me start with a simple question. When a producer sends a message to Kafka, the leader broker writes it down to its local disk and sends the acknowledgment back. Is this considered done and available for consumption?

Not quite.

What if the leader crashes before the in-sync replicas (ISRs) have yet to copy the message from the leader broker? If a consumer has consumed the message too eagerly, the system can no longer recover the message. From the consumer's point of view, the message existed, but from the system's point of view, after recovery, it never did.

Let's understand the sequence of steps in a 3 broker (Broker-1, Broker2, Broker-3) Kafka cluster with producer acks=1, ISR=2, replication-factor=3

1. Leader (Broker-1) writes message M, sends ack to producer.
2. A consumer too eagerly consumes message M.
3. Broker-1 crashes, M exists ONLY on Broker-1's disk.
4. Kafka elects a new leader from the ISR, say Broker-3.
5. Broker-3 doesn't have the message M. Cluster is now running without M.
6. Eventually, Broker-1 comes back online.
   But it comes back as a FOLLOWER, not a leader.
   It must sync itself to the new leader (Broker-3).
   Since Broker-3 doesn't have M, Broker-1 TRUNCATES its own log
   and deletes M to match.

→ M is now gone from every broker in the cluster. Permanently.

This is exactly the class of problem that High-Watermark (HWM) solves in Apache Kafka. And once you understand the concept, you will see it in other systems, at least conceptually, such as in Raft, in Zookeeper, in PostgreSQL replication.

Definitions First

Three important terms are often required to understand HWM in Kafka.

Log End Offset (LEO) is simply the offset of the next message to be written. Think of it as the tip of the log. If a broker has messages at offsets 0, 1, 2, 3, 4, its LEO is 5.

High-Watermark (HWM) is the highest message offset that has been fully replicated to every broker in the current in-sync replica set. Consumers can only read up to this point, never beyond it.

In-Sync Replica (ISR) is the set of brokers (leader + followers) that are caught up to the leader. A follower falls out of the ISR if it stops fetching messages fast enough, controlled by the config replica.lag.time.max.ms (default: 30 seconds).

The most important relationship to remember, HWM<=LEO always. The leader's LEO moves ahead with every write. The HWM index moves ahead only when all ISR members confirm they have replicated the latest data.

A Concrete Walk-Through

Let's use a simple setup through this post:

  • Replication factor=3: one leader (Broker-1) and two followers (Broker-2, Broker-3).

  • ISR={Broker-1, Broker-2, Broker-3}, all three are in sync.

  • acks=1 The producer gets an acknowledgment as soon as the leader writes locally.

  1. The Starting State

Everything is caught up. All three brokers have messages A through E.

Offset:          0    1    2    3    4
                [A]  [B]  [C]  [D]  [E]

Broker-1  (L):  [A]  [B]  [C]  [D]  [E]   LEO=5, HWM=5
Broker-2  (F):  [A]  [B]  [C]  [D]  [E]   LEO=5
Broker-3  (F):  [A]  [B]  [C]  [D]  [E]   LEO=5

Consumer can read: A, B, C, D, E (offsets 0 through 4)

HWM = 5 means "everything up to offset 4 is safe." Consumers are happy.

  1. A New Message Arrives: Producer Writes F

The producer sends message F. With acks=1, the leader writes it locally and immediately sends back an acknowledgment to the producer. Job done from the producer's perspective.

Offset:          0    1    2    3    4    5
                [A]  [B]  [C]  [D]  [E]  [F]

Broker-1  (L):  [A]  [B]  [C]  [D]  [E]  [F]     LEO=6, HWM=5 ← still 5!
Broker-2  (F):  [A]  [B]  [C]  [D]  [E]          LEO=5
Broker-3  (F):  [A]  [B]  [C]  [D]  [E]          LEO=5

Producer: Got acknowledgment. Thinks F is written.
Consumer can read: A, B, C, D, E but F is NOT visible yet.

Notice something important here, though the producer got its ack, but the HWM is still 5. Consumers still can't see F.

This is the key insight: acks and HWM are completely independent mechanisms. acks controls when the producer hears back. HWM controls when consumers are allowed to read. The leader knows it can't move the HWM forward yet because Broker-2 and Broker-3 haven't fetched F.

  1. Replication Happens — But Not Evenly

Followers in Kafka continuously send fetch requests to the leader, pulling new messages. Let's say, Broker-2 is quick and fetches F right away. Broker-3 is slightly behind—maybe it had a brief GC pause or just a slower network.

Offset:          0    1    2    3    4    5
                [A]  [B]  [C]  [D]  [E]  [F]

Broker-1  (L):  [A]  [B]  [C]  [D]  [E]  [F]   LEO=6, HWM=5
Broker-2  (F):  [A]  [B]  [C]  [D]  [E]  [F]   LEO=6
Broker-3  (F):  [A]  [B]  [C]  [D]  [E]        LEO=5
                                          ↑
                              HWM stays at 5.
Broker-3 is still in ISR and hasn't fetched F yet.

Consumer can read: A through E. F is still blocked.

How does the leader know about the last message offset in Broker-3? When every follower sends a fetch request to the leader, it carries the offset the follower wants next. When Broker-3 sends FetchRequest(offset=5), the leader knows it has everything up to offset 4. When Broker-2 sends FetchRequest(offset=6), the leader knows Broker-2 has message F. The leader tracks the minimum across all ISR members, and that's the HWM.

  1. Broker-3 Catches up → HWM Finally Advances

Broker-3 fetches F. Now all three ISR members have it. The leader advances the HWM to 6.

Offset:          0    1    2    3    4    5
                [A]  [B]  [C]  [D]  [E]  [F]

Broker-1  (L):  [A]  [B]  [C]  [D]  [E]  [F]   LEO=6, HWM=6
Broker-2  (F):  [A]  [B]  [C]  [D]  [E]  [F]   LEO=6
Broker-3  (F):  [A]  [B]  [C]  [D]  [E]  [F]   LEO=6

Consumer can now read: A, B, C, D, E, F

F is now a committed message. Every ISR member has it. Even if the leader crashes right now, any follower that becomes the new leader will have F, and no data is lost.

So What happens When the Leader Crashes Mid-Replication?

Let's go back to the state where Broker-3 hadn't yet fetched F:

Broker-1  (L):  [A]  [B]  [C]  [D]  [E]  [F]   ← crashes right here
Broker-2  (F):  [A]  [B]  [C]  [D]  [E]  [F]
Broker-3  (F):  [A]  [B]  [C]  [D]  [E]
                                         HWM = 5

Kafka now needs to elect a new leader from the ISR. Either Broker-2 or Broker-3 can win. Let's say Broker-3 wins the election.

Broker-3 (New L): [A]  [B]  [C]  [D]  [E]   ← only has up to E. F is gone.
Broker-2  (F):    [A]  [B]  [C]  [D]  [E]  [F]  ← has F, but must truncate to match the new leader

New HWM = 5. F is permanently lost.

Broker-2 will truncate its own log to match the new leader, dropping F. The message is gone.

But here's the important part: no consumer ever read F. It was above the HWM the entire time. From the consumer's perspective, nothing unusual happened. The system is consistent.

This is the trade-off you accept with acks=1. The producer got an ack, but the system made no durability promise. If you want F to survive any single broker failure, you need acks=all.

Now, let's say Broker-2 wins the election instead.

Broker-2 (New L): [A]  [B]  [C]  [D]  [E]  [F] ← has F, becomes leader
Broker-3  (F):    [A]  [B]  [C]  [D]  [E]      ← catches up and fetches F

HWM advances to 6 once Broker-3 fetches F. F survives.

Whether F survives depends on which ISR member wins the election. This non-determinism is exactly why acks=1 provides weaker durability. The producer receives an acknowledgment before replication, so a leader crash can still cause data loss.

The acks Settings Compared

Let's make this concrete with all three options:

acks=0: Producer sends and moves on. No acknowledgment at all. Fastest, but any broker issue can lead to loss of data. Good for metrics or logs where occasional loss is acceptable.

acks=1: Leader writes locally and acknowledges. Followers replicate in the background. If the leader crashes before replication, the message is lost. HWM still protects consumers from reading uncommitted data.

acks=all (or -1) with min.insync.replicas=2: Leader waits until at least 2 ISR members have written the message before acknowledging. F would only get an ack after the leader and at least one follower confirmed it. Slower, but your data may survive a single broker failure.

The combination you want for anything important is acks=all + min.insync.replicas=2 + replication.factor=3 + unclean.leader.election.enable=false. These together provide strong durability guarantees and are the common production configuration for surviving a single broker failure.

Remember, Kafka's durability guarantee is based on HWM, not producer ACK. This means a producer can receive an acknowledgment for a message that is still above the HWM (LEO > HWM) and therefore not yet fully committed.

Summary

The High-Watermark is Kafka's promise to consumers: "Everything you read has been replicated to every broker in the ISR — you will never read a message the cluster can't recover."

The LEO always moves ahead. The HWM follows carefully behind. And the gap between them — that narrow window of messages written but not yet fully replicated — is the territory Kafka keeps hidden from consumers until it's safe.

Understanding this precisely gives you the ability to reason about acks settings, ISR configuration, and replication lag in a way that goes beyond config documentation. Understanding this concept well also ensures that, in a production Kafka cluster, you know what to look for when the producer gets an acknowledgement from the leader, but the consumers don't see the message.