The ByteFreak Blog

High-Watermark in Distributed Systems: A Deep Dive with Apache Kafka

Subhashish (Subh) Bhattacharjee — Wed, 04 Mar 2026 05:32:21 GMT

Let me start with a simple question. When a producer sends a message to Kafka, the leader broker writes it down to its local disk and sends the acknowledgment back. Is this considered done and available for consumption?

Not quite.

What if the leader crashes before the in-sync replicas (ISRs) have yet to copy the message from the leader broker? If a consumer has consumed the message too eagerly, the system can no longer recover the message. From the consumer's point of view, the message existed, but from the system's point of view, after recovery, it never did.

Let's understand the sequence of steps in a 3 broker (Broker-1, Broker2, Broker-3) Kafka cluster with producer acks=1, ISR=2, replication-factor=3

1. Leader (Broker-1) writes message M, sends ack to producer.
2. A consumer too eagerly consumes message M.
3. Broker-1 crashes, M exists ONLY on Broker-1's disk.
4. Kafka elects a new leader from the ISR, say Broker-3.
5. Broker-3 doesn't have the message M. Cluster is now running without M.
6. Eventually, Broker-1 comes back online.
   But it comes back as a FOLLOWER, not a leader.
   It must sync itself to the new leader (Broker-3).
   Since Broker-3 doesn't have M, Broker-1 TRUNCATES its own log
   and deletes M to match.

→ M is now gone from every broker in the cluster. Permanently.

This is exactly the class of problem that High-Watermark (HWM) solves in Apache Kafka. And once you understand the concept, you will see it in other systems, at least conceptually, such as in Raft, in Zookeeper, in PostgreSQL replication.

Definitions First

Three important terms are often required to understand HWM in Kafka.

Log End Offset (LEO) is simply the offset of the next message to be written. Think of it as the tip of the log. If a broker has messages at offsets 0, 1, 2, 3, 4, its LEO is 5.

High-Watermark (HWM) is the highest message offset that has been fully replicated to every broker in the current in-sync replica set. Consumers can only read up to this point, never beyond it.

In-Sync Replica (ISR) is the set of brokers (leader + followers) that are caught up to the leader. A follower falls out of the ISR if it stops fetching messages fast enough, controlled by the config replica.lag.time.max.ms (default: 30 seconds).

The most important relationship to remember, HWM<=LEO always. The leader's LEO moves ahead with every write. The HWM index moves ahead only when all ISR members confirm they have replicated the latest data.

A Concrete Walk-Through

Let's use a simple setup through this post:

Replication factor=3: one leader (Broker-1) and two followers (Broker-2, Broker-3).
ISR={Broker-1, Broker-2, Broker-3}, all three are in sync.
acks=1 The producer gets an acknowledgment as soon as the leader writes locally.

The Starting State

Everything is caught up. All three brokers have messages A through E.

Offset:          0    1    2    3    4
                [A]  [B]  [C]  [D]  [E]

Broker-1  (L):  [A]  [B]  [C]  [D]  [E]   LEO=5, HWM=5
Broker-2  (F):  [A]  [B]  [C]  [D]  [E]   LEO=5
Broker-3  (F):  [A]  [B]  [C]  [D]  [E]   LEO=5

Consumer can read: A, B, C, D, E (offsets 0 through 4)

HWM = 5 means "everything up to offset 4 is safe." Consumers are happy.

A New Message Arrives: Producer Writes F

The producer sends message F. With acks=1, the leader writes it locally and immediately sends back an acknowledgment to the producer. Job done from the producer's perspective.

Offset:          0    1    2    3    4    5
                [A]  [B]  [C]  [D]  [E]  [F]

Broker-1  (L):  [A]  [B]  [C]  [D]  [E]  [F]     LEO=6, HWM=5 ← still 5!
Broker-2  (F):  [A]  [B]  [C]  [D]  [E]          LEO=5
Broker-3  (F):  [A]  [B]  [C]  [D]  [E]          LEO=5

Producer: Got acknowledgment. Thinks F is written.
Consumer can read: A, B, C, D, E but F is NOT visible yet.

Notice something important here, though the producer got its ack, but the HWM is still 5. Consumers still can't see F.

This is the key insight: acks and HWM are completely independent mechanisms. acks controls when the producer hears back. HWM controls when consumers are allowed to read. The leader knows it can't move the HWM forward yet because Broker-2 and Broker-3 haven't fetched F.

Replication Happens — But Not Evenly

Followers in Kafka continuously send fetch requests to the leader, pulling new messages. Let's say, Broker-2 is quick and fetches F right away. Broker-3 is slightly behind—maybe it had a brief GC pause or just a slower network.

Offset:          0    1    2    3    4    5
                [A]  [B]  [C]  [D]  [E]  [F]

Broker-1  (L):  [A]  [B]  [C]  [D]  [E]  [F]   LEO=6, HWM=5
Broker-2  (F):  [A]  [B]  [C]  [D]  [E]  [F]   LEO=6
Broker-3  (F):  [A]  [B]  [C]  [D]  [E]        LEO=5
                                          ↑
                              HWM stays at 5.
Broker-3 is still in ISR and hasn't fetched F yet.

Consumer can read: A through E. F is still blocked.

How does the leader know about the last message offset in Broker-3? When every follower sends a fetch request to the leader, it carries the offset the follower wants next. When Broker-3 sends FetchRequest(offset=5), the leader knows it has everything up to offset 4. When Broker-2 sends FetchRequest(offset=6), the leader knows Broker-2 has message F. The leader tracks the minimum across all ISR members, and that's the HWM.

Broker-3 Catches up → HWM Finally Advances

Broker-3 fetches F. Now all three ISR members have it. The leader advances the HWM to 6.

Offset:          0    1    2    3    4    5
                [A]  [B]  [C]  [D]  [E]  [F]

Broker-1  (L):  [A]  [B]  [C]  [D]  [E]  [F]   LEO=6, HWM=6
Broker-2  (F):  [A]  [B]  [C]  [D]  [E]  [F]   LEO=6
Broker-3  (F):  [A]  [B]  [C]  [D]  [E]  [F]   LEO=6

Consumer can now read: A, B, C, D, E, F

F is now a committed message. Every ISR member has it. Even if the leader crashes right now, any follower that becomes the new leader will have F, and no data is lost.

So What happens When the Leader Crashes Mid-Replication?

Let's go back to the state where Broker-3 hadn't yet fetched F:

Broker-1  (L):  [A]  [B]  [C]  [D]  [E]  [F]   ← crashes right here
Broker-2  (F):  [A]  [B]  [C]  [D]  [E]  [F]
Broker-3  (F):  [A]  [B]  [C]  [D]  [E]
                                         HWM = 5

Kafka now needs to elect a new leader from the ISR. Either Broker-2 or Broker-3 can win. Let's say Broker-3 wins the election.

Broker-3 (New L): [A]  [B]  [C]  [D]  [E]   ← only has up to E. F is gone.
Broker-2  (F):    [A]  [B]  [C]  [D]  [E]  [F]  ← has F, but must truncate to match the new leader

New HWM = 5. F is permanently lost.

Broker-2 will truncate its own log to match the new leader, dropping F. The message is gone.

But here's the important part: no consumer ever read F. It was above the HWM the entire time. From the consumer's perspective, nothing unusual happened. The system is consistent.

This is the trade-off you accept with acks=1. The producer got an ack, but the system made no durability promise. If you want F to survive any single broker failure, you need acks=all.

Now, let's say Broker-2 wins the election instead.

Broker-2 (New L): [A]  [B]  [C]  [D]  [E]  [F] ← has F, becomes leader
Broker-3  (F):    [A]  [B]  [C]  [D]  [E]      ← catches up and fetches F

HWM advances to 6 once Broker-3 fetches F. F survives.

Whether F survives depends on which ISR member wins the election. This non-determinism is exactly why acks=1 provides weaker durability. The producer receives an acknowledgment before replication, so a leader crash can still cause data loss.

The `acks` Settings Compared

Let's make this concrete with all three options:

acks=0: Producer sends and moves on. No acknowledgment at all. Fastest, but any broker issue can lead to loss of data. Good for metrics or logs where occasional loss is acceptable.

acks=1: Leader writes locally and acknowledges. Followers replicate in the background. If the leader crashes before replication, the message is lost. HWM still protects consumers from reading uncommitted data.

acks=all (or -1) with min.insync.replicas=2: Leader waits until at least 2 ISR members have written the message before acknowledging. F would only get an ack after the leader and at least one follower confirmed it. Slower, but your data may survive a single broker failure.

The combination you want for anything important is acks=all + min.insync.replicas=2 + replication.factor=3 + unclean.leader.election.enable=false. These together provide strong durability guarantees and are the common production configuration for surviving a single broker failure.

Remember, Kafka's durability guarantee is based on HWM, not producer ACK. This means a producer can receive an acknowledgment for a message that is still above the HWM (LEO > HWM) and therefore not yet fully committed.

Summary

The High-Watermark is Kafka's promise to consumers: "Everything you read has been replicated to every broker in the ISR — you will never read a message the cluster can't recover."

The LEO always moves ahead. The HWM follows carefully behind. And the gap between them — that narrow window of messages written but not yet fully replicated — is the territory Kafka keeps hidden from consumers until it's safe.

Understanding this precisely gives you the ability to reason about acks settings, ISR configuration, and replication lag in a way that goes beyond config documentation. Understanding this concept well also ensures that, in a production Kafka cluster, you know what to look for when the producer gets an acknowledgement from the leader, but the consumers don't see the message.

When Bloom Filters Fail: False Positives, Memory Trade-offs, Production Lessons

Subhashish (Subh) Bhattacharjee — Wed, 11 Feb 2026 04:14:24 GMT

In my previous article, Bloom Filter: Definitely No, Probably Yes, we saw that a Bloom filter acts like a ‘magic’ toolbox to perform quick operations on large datasets to determine whether a value is certainly not in the set. However, this 'definitely no, probably yes' nature, while enabling optimization, can become a silent killer if not designed with growth and observability in mind.

False positives don’t break correctness—they shift the load

A bloom filter never returns false negatives, which makes it feel safe. But false positives still matter.

Imagine that you have implemented a cache-penetration bloom filter sized for 100M user IDs at 1% false positive (FP). The following year, the user base grows to 500M. The FP rate degrades to ~30%, meaning 30% of IDs now hit Redis and the DB anyway, negating the value of the bloom filter.

So, false positives don’t cause failures; they just shift the load to the downstream systems, which effectively negates the benefit of the filter.

Memory sizing decisions age poorly

Bloom filters are typically sized based on current data volume at deployment. However, as data grows, there is an increase in the number of false positives that reduces the efficacy.

Because there are no errors or alerts, the degradation goes unnoticed. A filter that seems sufficient today doesn't fail as data grows—it just becomes less effective, silently shifting the load downstream without raising alarms.

Hence, a bloom filter doesn’t end with deployment in production; it requires constant monitoring and scaling as data grows.

More hash functions aren’t always a fix

When false positives increase, it’s tempting to increase the number of hash functions. This is an obvious choice, but we need to be aware of the impact it can have on the overall latency.

Each additional hash function increases CPU work and touches more memory. In a latency-sensitive path, this can increase the tail latency (p95/p99) even if false positives drop significantly. Research on LSM-trees shows that with fast storage (NVMs), Bloom filter hashing can dominate query latency and can become a significant bottleneck as key sizes grow.

Hence, increasing the hash functions can help up to a certain point, and performance needs to be measured based on the system’s latency requirements.

Observability matters more than configuration

As you may have understood by now, implementing a bloom filter for a production use case is not effective without good observability metrics. A few of the important metrics are

How often does the filter return positives?

bloom_positives_total / bloom_checks_total
How many downstream calls do those positives trigger?

(bloom_positives_total - bloom_true_positives) / bloom_positives_total
Does it still reduce load over time?

(db_calls_without_filter - db_calls_with_filter) / db_calls_without_filter

When Bloom Filters may be the wrong choice

False positives are expensive or unacceptable: each false positive triggers a full cache + DB check. If your downstream system can’t handle the extra load, don’t use a Bloom filter.
The check lies in a latency-critical path: hashing overhead (especially with many hash functions) adds microseconds that matter in p99-sensitive APIs.
The data set is small enough: if your dataset fits in a few MB of Redis or can be indexed efficiently in Postgres or a similar system, a Bloom filter is overkill.

Summary

Bloom filters are powerful tools, but they are not set-and-forget optimizations.

In real systems, their value depends on sizing, memory trade-offs, and observability. Normally, they reduce the load. If used casually, they quietly move problems downstream.

Bloom Filter: Definitely No, Probably Yes

Subhashish (Subh) Bhattacharjee — Tue, 03 Feb 2026 15:32:01 GMT

In large-scale distributed systems, knowing what you don’t have is often more valuable than knowing what you do.

Let’s understand with a practical example. Imagine you are building a recommendation engine for a blogging site like Hashnode or Medium, where users read blog posts. You want to show users fresh content they will love to check out, and you certainly don’t want to recommend them articles they have already read.

You could query the primary database to fetch the user’s history and filter out already-seen articles. However, for a large-scale blog site, hitting the DB for every recommendation would be a waste of I/O. You can store this information (userId → list of articleIds) in a Redis cache. But let’s look at the cost of keeping such a large dataset in-memory:

Let’s do a back-of-the-envelope calculation:

100 Million Users
Suppose each user has read 100 articles (on average).
ArticleId is a 64-bit-long UUID.

100M users * 100 articles * 8 bytes = 80 GB (ignoring overhead)

Though it’s manageable at this stage, it will only grow with time. This is where the Bloom Filter shines. It is a space-efficient probabilistic data structure that answers one question extremely fast: “Is this element in the set? ” and that too without storing the actual element. This increases lookup performance and, at the same time, may only occupy an order of magnitude less memory than storing all IDs directly.

The outcome is either of the two:

No (it is definitely not in the set).
Maybe yes (it might be in the set or be a false positive).

Why Use Bloom Filter?

Beyond saving memory in recommendation engines, Bloom filters can solve a critical security issue in backend systems: cache penetration.

Let’s consider a standard Cache-Aside pattern:

App requests data for Key X
Cache Miss (Key X not found).
The app queries the database.
The app then populates the cache.

Now, imagine an attacker (or a bug in the code) spams your API with random, nonexistent UUIDs.

The cache misses (because the keys don't exist).
Every single request goes to the database.
The database doesn’t find it, so it returns “Not Found.”

If the volume is high enough, this "thundering herd" of invalid requests can bring down the primary database.

You add a Bloom Filter in front of the cache. If the filter returns “No,” you trust it immediately and return 404 Not Found. You don’t touch the cache, and definitely not touch the database.

How It Works: The Bit Array

Under the hood, a Bloom Filter doesn’t store the actual data; it stores 1s and 0s in a bit array to mark the presence of the value under consideration (a userId or userName, etc.). It has two main parts:

An array of m bits, all initialized to 0.
h different hash functions.

Write Path:
A simple example is to add user information to the bloom filter so as to later know whether the user exists in the system or not. Let’s add the user cathy. We pass her name through our h=3 hash functions.

h1("cathy") % m = 1
h2("cathy") % m = 3
h3("cathy") % m = 6

We set those indices in our array to 1 (as shown in the figure below). That’s it.

Read Path:
Now, let’s check if cathy exists. We run through the same hashes and check if the values at indices 1, 3, and 6 are all 1 . If yes, then she probably exists.

Now, let's check bob who was never added.

h1("bob") % m points to 1 (the value at index 1 is 1 because of cathy).
h2("bob") % m points to 7 (the value at index 7 is 0).
h3("bob") % m points to 6 (the value at index 6 is 1 because of cathy).

Because the value at index 7 is 0 what we know for a fact that bob has never been added, we stop, and the result is “Definitely No.”

The “False Positive” Case

So why did I say "Probably Yes" in the title?

Imagine we keep adding users until almost every box in the array is flipped to 1. Eventually, we might check for a user named jack. By sheer coincidence, his hash values land on indices that are already populated with value 1 by other users.

The filter sees all 1s.
It tells you, “Maybe Jack exists.”
But in reality, he doesn’t.

This is a false positive. It’s the trade-off we need to make for speed and small memory usage. We can reduce these errors by making the array larger, using more hash functions, or using both, but we can never eliminate them 100%.

Standard Bloom filters don’t support deleting values (though it can be achieved by using a Counting Bloom Filter). Once a bit is flipped to 1, it stays 1, because there is no way to know if that bit belongs to cathy or bob. This makes it perfect for read-heavy systems, or when changes are relatively append-only.

Real-World Use Cases

As discussed, it can be used in cache penetration protection for requests with non-existent keys to not hammer the database every time there is a cache miss.
Used in web crawlers to check if a web page has already been crawled so as to avoid crawling again.
Used in systems such as Cassandra, HBase, etc., which use LSM-Tree-like structures. In Cassandra, data is first written to the memtable (an in-memory data structure), and then periodically, the data is flushed to disk in SSTables. Every SSTable maintains its own bloom filter. While you query the data, DB checks if the data is in the memtable; if not, then it checks the bloom filter for each SSTable (newest → oldest). If the bloom filter returns ‘not found,’ it entirely skips the index and data files and then checks the bloom filter for the second SSTable and so on. This reduces the disk search and hence drastically improves performance.

Summary

Bloom filters are one of those “magic” tools that let you reject invalid requests instantly with a low memory footprint.

No → Definitely not in the set (Trust 100%).
Maybe → May be in the set (check the DB or downstream system to confirm).

Concurrency vs. Parallelism: A Coffee Shop Guide for Developers

Subhashish (Subh) Bhattacharjee — Thu, 22 Jan 2026 17:06:16 GMT

If you ask ten developers to explain the difference between concurrency and parallelism, you might get ten slightly different answers. It’s one of those fundamental concepts that is easy to grasp abstractly but tricky to visualize in practice.

To understand where we are today, we have to look at where we started.

The Single-Core Era vs. The Multi-Core Revolution

Back in the “old days” of computing, we relied on single-core processors. Despite this limitation, computers still seemed to multitask. You could listen to music while typing a document, and it felt simultaneous. But it was rather an illusion.

The processor was frantically switching between tasks—giving a few milliseconds to the music player, then a few milliseconds to the word processor—so quickly that we humans couldn’t notice the gap. This is the foundation of threading.

Today, we have multi-core processors (dual-core, quad-core, octa-core, etc.) that can physically execute multiple instructions at the exact same instant. However, to utilize that power, we first need to design our software correctly.

Defining the Terms

Concurrency is about structure. It is the composition of a program into small, independent tasks that can be executed out of order or in partial order.
Parallelism is about execution. It is the simultaneous execution of distinct tasks.

You can have concurrency without parallelism (the single-core example), but you generally cannot have parallelism without concurrency.

The “Context Switch”

In a concurrent system, the CPU has to save the state of the current thread (variables, instruction pointers) and load the state of the next thread. This is called a Context Switch.

Context switching is necessary for responsiveness, but it isn’t free. If you have a single processor and you spin up 1,000 threads, your computer might spend more time switching between them than actually doing the work!

The Coffee Shop Analogy

Let’s visualize this with a simple office breakroom scenario.

Scenario A: Concurrent but NOT Parallel
Imagine an office breakroom with two lines of developers but only one coffee machine.

The developers are independent “tasks”.
The queues represent the structure (Concurrency).
The coffee machine is the CPU.

Even though there are two lines, the coffee machine can only brew one cup at a time. It might serve the first person in Line A, then switch to the first person in Line B. This is concurrency. The tasks are progressing, and no single line is completely blocked, but they are sharing the same resource.

Scenario B: Concurrent AND Parallel
Now, management buys a second coffee machine.

We still have the two lines (Concurrency).
But now, the person in Line A and the person in Line B can press “Brew” at the same instant.

This is parallelism. Because we structured the problem correctly (separate queues), adding more hardware (the second coffee machine) instantly doubled our throughput. Let’s understand with a diagram.

An example code in Java should clarify further.

import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

public class CoffeeShop {
    public static void main(String[] args) {
        // Scenario 2: Parallelism (2 Coffee Machines -> 2 Threads)
        ExecutorService coffeeMachines = Executors.newFixedThreadPool(2);

        Runnable makeCoffee = () -> {
            String threadName = Thread.currentThread().getName();
            System.out.println(threadName + " is brewing coffee...");
            try { 
                Thread.sleep(2000); 
            } catch (InterruptedException e) {} // Simulate brewing
            System.out.println(threadName + " is finished!");
        };

        // Two people order at the same time
        coffeeMachines.submit(makeCoffee);
        coffeeMachines.submit(makeCoffee);

        coffeeMachines.shutdown();
    }
}

If you run this, both “brewing” messages appear instantly. If you change the thread pool to 1 (concurrency), the second message would only appear after the first one finishes.

Summary

Designing for concurrency means structuring your program so that tasks don’t rely on each other unnecessarily. If you write a program where every step must happen sequentially (Line A must finish before Line B starts), you can never parallelize it, no matter how many cores your system has.

Concurrency is about dealing with lots of things at once. Parallelism is about doing lots of things at once.

From Modulo to Consistent Hashing: Optimizing Distributed Storage

Subhashish (Subh) Bhattacharjee — Wed, 28 May 2025 13:43:24 GMT

🔥 Ever tried scaling a single database past its limits? You’ll quickly encounter massive rebalance storms and downtime. While a single-server setup might handle initial workloads easily, expanding to tens or hundreds of millions of users demands distributed storage, bringing unique challenges to data management.

⚙️ Design Goals for Distributed Systems

Uniform distribution: Avoid hotspots by evenly spreading data across nodes.
High throughput: Scale horizontally for fast reads and writes.
Elasticity: Add or remove nodes without disrupting service.
Resilience: Handle node failures, network partitions, and unpredictable workloads.

To achieve this, we need a mechanism that quickly determines data placement and minimizes shard rebalancing during cluster changes. Let's explore the need for consistent hashing with an illustrative example.

🛒 Use Case: Shopping Cart Service in a Global E-Commerce Site

Imagine you are building the shopping cart service for a global e-commerce site. Initially, with just a few thousand users, everything is simple and fits into a single node, as illustrated in Figure 1.

Your business is a success, and now the user base is growing rapidly along with the data. This rapid growth soon necessitated data distribution across multiple nodes—a process known as sharding, as shown in Figure 2.

You store the information in a fast key-value store to ensure that actions like “Add to cart” and “Checkout” are quick and responsive. Here’s how a record may look in a key-value store:

Key   = ""
Value = 
{
  "userId": "827391",
  "items": [
    {"productId": "SKU12345", "quantity": 2, "unitPrice": 25.99}
  ],
  "lastUpdated": "2025-05-27T14:32:15Z",
  "currency": "USD"
}

To scale, it’s necessary to shard the data, specifically user_id in our case. But how does the system decide the mapping between the shard ID and the server node? In other words, how do I know where to put my data?

🔢 Modulo Hashing

Hashing deterministically converts variable‑length inputs into fixed‑size numeric values. A strong hash is fast, uniformly distributes outputs to avoid hotspots, and has a low collision rate—though it’s one‑way, so you can’t reverse it to recover the original data. Good hashing sets the stage for efficient sharding.

Let’s start our use case with a 3-node cluster (N0, N1, N2). Modulo hashing efficiently determines data placement:

nodes = ["N0", "N1", "N2"]
N = len(nodes)  # 3
db_node_id = nodes[hash(user_id) % N]

# Example assignments:
# user_id=42  -> hash(42) % 3 = 0 → nodes[0] →  "N0"
# user_id=100 -> hash(100) % 3 = 1 → nodes[1] → "N1"
# user_id=107 -> hash(107)% 3 = 2 → nodes[2] →  "N2"

It’s a super-simple lookup. The distribution of shopping cart data among the three nodes in the database cluster is depicted in Figure 3.

Key Definitions

Cluster: A group of server nodes working together to store and serve your data.

Shard Manager: The component (often built into the database or cache) that maps each shard ID to a specific node in the cluster.

While modulo hashing is simple, it suffers significantly when scaling up or down.

➕ Adding a New Node to the cluster

During the peak sale season, a new node is added to take up some load. When adding a 4th node, recalculations shift almost every user’s data, causing extensive rebalancing:

nodes = ["N0", "N1", "N2", "N3"]
N = len(nodes)  # 4
db_node_id = nodes[hash(user_id) % N]

# Example assignments:
# user_id=42  -> hash(42) % 4 = 2 → nodes[2] →  "N2" (shifted)
# user_id=100 -> hash(100) % 4 = 0 → nodes[0] → "N0" (shifted)
# user_id=107 -> hash(107)% 4 = 3 → nodes[3] →  "N3" (shifted)

➖ Removing a Node from the cluster

Say, N0 is removed from the cluster, then the calculation of the new shard mapping will look something like this,

nodes = ["N1", "N2", "N3"]
N = len(nodes)  # 3
db_node_id = nodes[hash(user_id) % N]

# Example assignments:
# user_id=42  -> hash(42) % 3 = 0 → nodes[0] →  "N1" (shifted)
# user_id=100 -> hash(100) % 3 = 1 → nodes[1] → "N2" (shifted)
# user_id=107 -> hash(107)% 3 = 2 → nodes[2] →  "N3" (unchanged)

Similarly, removing a node causes widespread data movement.

Since the calculation of which node the user ID must go to depends on the total number of active nodes in the cluster, the amount of data movement required when a node is added or removed is high. Frequent reshuffling in large systems can be inefficient, leading to downtime and performance hits.

🌐 Consistent Hashing: The Scalable Cure

Consistent hashing is a highly efficient hashing mechanism that is used in many large-scale distributed systems, such as Cassandra and DynamoDB.

Consistent hashing represents the whole key space as a logical ring. The size of the ring is determined by the cluster’s size; for our example, let’s keep it between 0 and 360 for ease of understanding. Each physical node is hashed using a hash function and placed in the corresponding positions on the hash ring as shown in Figure 6.

Just as vehicles choose the nearest exit on a roundabout, consistent hashing picks the ‘closest’ node on the ring to store each key.

So, in a key-value store that uses consistent hashing, the keys are hashed using the same hash function, and the position of the node is the nearest node in the clockwise direction from the position of the key on the ring. If we consider our previous use case of shopping cart data, when we consider the same user IDs, the calculation is slightly changed as shown below.

db_node_id = hash(user_id) % ring_size (ring_size = 360)

#  user_id=42 → hash(42)%360=42 →   Node 0 [right of 42 is N0 at 90]
#  user_id=100 → hash(100)%360=100→  Node 1 [right of 100 is N1 at 220]
#  user_id=107→ hash(107)%360=107→  Node 1 [right of 107 is N1 at 220]

Figure 7 helps to clarify this further.

➕ Adding a New Node to the cluster

Let’s understand with a diagram,

A few things happen when a new node is added as can be seen in Figure 8.

The keys 100 and 107, which were previously part of node N1, are now part of the new node N3, requiring a remapping of keys from Node N1 to N3.
The existing data also needs to be moved from N1 to N3.

➖ Removing a Node from the cluster

Let’s understand with a diagram,

A few things happen when node N0 is removed as can be seen from Figure 9.

The key 42, which was part of node N0, is now part of node N3, requiring a remapping of keys from Node N0 to N3.
The existing data also needs to be moved from N0 to N3.

In both scenarios, there is minimal data movement required to adjust the cluster.

In our case, it’s a simple example, however, it illustrates the concept upon which consistent hashing is built. In our shopping cart use case for an internet-scale global e-commerce site, there can be hundreds of millions of users and thousands of nodes, and the number of shard IDs can be much higher than the number of physical nodes on the ring. This can lead to potential data skewness, resulting in hotspots.

For example, in Figure 9, if there are many keys whose positions on the ring are between 91 and 165, then all those will eventually land on N3, potentially making it a hotspot. Additionally, if N3 goes down, then all the load will shift to N1, which may overload and fail N1, in which case, the existing load on N1 will shift to N2, again overloading N2 and potentially causing the node to fail. This is called cascading failure. In order to circumvent cascading failure and to uniformly distribute data across physical nodes, there is a concept called a Virtual Node.

⚖️Virtual Nodes: Enhancing Balance and Stability

Virtual Node, as the name suggests, is logical and is added to the consistent hashing ring to bring uniformity in data distribution and avoid cascading failures. Virtual nodes are essentially positions on the ring. Let’s see how.

There are three physical nodes in the cluster: N0, N1, and N2.
We create two virtual nodes for each physical node (N0-0, N0-1, etc.).
As a result, there will be more node positions on the hash ring.
This allows for uniformity in data and load distribution, as can be seen in Figure 10, thus reducing the chance of hotspots.
The system maintains a mapping between virtual nodes and physical nodes in the form of a Map Map.

Key-on-node-edge: If a key’s hash exactly matches a vnode’s position (e.g., key at 90), it maps to that vnode. In our example, a key at 90 lands on the vnode at 90 rather than the next one at 104.

🛠️ Consistent Hashing Prototype

Explore a working Java prototype demonstrating key operations:

// --- SimpleConsistentHashRing.java ---
import java.util.*;

public class SimpleConsistentHashRing {
    private final SortedMap ring = new TreeMap<>();
    private final int N = 360;  // ring size

    public void addNode(String nodeId) {
        int hash = Objects.hash(nodeId) % N;
        ring.put(hash, nodeId);
    }

    public void removeNode(String nodeId) {
        ring.values().removeIf(id -> id.equals(nodeId));
    }

    public String getNodeForKey(String key) {
        if (ring.isEmpty()) return null;
        int hash = Objects.hash(key) % N;

        SortedMap tail = ring.tailMap(hash);
        return tail.isEmpty() 
             ? ring.get(ring.firstKey()) 
             : ring.get(tail.firstKey());
    }

    public static void main(String[] args) {
        SimpleConsistentHashRing ring = new SimpleConsistentHashRing();
        ring.addNode("A"); 
        ring.addNode("B"); 
        ring.addNode("C");

        System.out.println("user:1001 → " + ring.getNodeForKey("user:1001"));
        ring.addNode("D");
        System.out.println("user:1001 → " + ring.getNodeForKey("user:1001"));
    }
}

✅ Production Best Practices

Use a large, fixed ring size (e.g., 64-bit). Changing the ring size calls for a complete cluster rebalancing and is not efficient.
Select fast, non-cryptographic hash functions (e.g., Murmur hash)
Allocate sufficient virtual nodes to avoid data skewness and hot spots.

💬 Final Thoughts & Discussion

Consistent hashing excels for stateful distributed storage, offering elasticity, resilience, and minimal rebalancing overhead. Curious about replication or why systems like Kafka can still scale using modulo hashing? Drop your thoughts in the comments below!

📚 Other Articles You May Like

Dive deeper into Write-Ahead Logs in Understanding Write-Ahead Logs: Durability Beyond the Flush

Understanding Write-Ahead Logs: Durability Beyond the Flush

Subhashish (Subh) Bhattacharjee — Tue, 15 Apr 2025 04:09:00 GMT

Databases are a fundamental part of modern software architecture. Depending on the use case, we rely on different types — from relational databases like PostgreSQL to NoSQL systems like Cassandra, or even distributed log systems like Kafka.

But have you ever wondered what happens to your data when the database crashes? How does the system ensure that your committed data isn't lost?

This is where Write-Ahead Log (WAL) comes into play. In this blog, we’ll dive into the internals of WAL, explore how it works behind the scenes, and understand its critical role in ensuring data durability.

🔧 What is WAL?

Every database uses an internal representation of data in memory — whether it’s based on B+ Trees or LSM Trees. When users issue commands to write or update records, these actions are first performed in memory and then periodically flushed to disk. This process is known as a checkpoint.

Since writes are batched before flushing, there’s always a risk of losing committed transactions if the system crashes before flushing to disk.

💡 One might think: “Why not flush every transaction directly to disk?”
Because it’s inefficient — writing every transaction involves random disk seeks, index updates, and structural changes, which decreases the throughput.

WAL solves this problem by introducing an immutable append-only log file. Each write is first recorded in the WAL, then applied to in-memory data structures.

📝 Think of WAL like a diary — jotting down everything before making the actual changes. If the system crashes mid-way, the diary can help restore what was lost.

🧠 Internals: How WAL Works

Writing to a sequential log (WAL) is significantly faster than writing to structured files.

Typical Write Path in a WAL-enabled Database

Steps:

Write is appended to the WAL — durability guaranteed.
Change is applied to an in-memory structure (like a memtable or buffer pool).
Once memory crosses a threshold, data is flushed to disk (checkpoint).
Old WAL logs can be purged after checkpoint to reduce log size.

🔍 Advantages of WAL

✅ Crash Recovery: Replays committed transactions from the WAL.
✅ Durability: Guarantees no data loss post-commit.
✅ Performance: Append-only writes are fast and sequential.
✅ Lazy Flushing: Flushes to disk in the background.
✅ Garbage Collection: Older WAL entries can be discarded post-checkpoint.
✅ Replication: WAL can be shipped to replicas for faster sync.

📦 Conceptual WAL Entry Format

A WAL entry typically stores:

LSN (Log Sequence Number, a byte offset for every record)
Transaction ID
Operation Type
Table + Row ID
Before/After values
Timestamp
CRC32 (for integrity)


LSN: 00001234
TransactionID: 99768
Operation: UPDATE
Table: users
RowID: 26
Before: { age: 20 }
After:  { age: 21 }
Timestamp: 2025-04-10 15:12:10
CRC32: 0x5d41402abc4b2a76b9719d911017c592

The above data is for representational purpose only, in real system, the data is stored in the binary format. Additionally, CRC32 is used as a checksum to ensure data integrity and usually calculated on the entire record.

💾 Durability and fsync()

It’s important to note that a DB operation is not truly durable just because it’s written to the WAL in memory or even buffered by the OS. There are multiple layers between the application and the actual disk:

To ensure durability, systems call fsync() (or similar system calls) to force the WAL to be flushed from the OS cache all the way to disk.

Every layer in the write path uses write buffering to improve performance, so calling fsync() tells the OS: “Please flush this data now.” However, frequent fsync() calls come at the cost of throughput. Many systems (like Kafka, PostgreSQL) batch writes and fsync periodically to strike a balance between durability and throughput.

⚙️ WAL Prototype

To solidify the concepts, I’ve built a simple WAL prototype in Java, showing:

Append-only log writes
Basic recovery logic

Below code snippet shows a simple WAL that logs different operations and flush changes to disk.

public class WriteAheadLog {
    private final File logFile;
    private final BufferedWriter writer;

    /**
     * Initializes the Write-Ahead Log with a given file name.
     * Creates or appends to the file if it already exists.
     *
     * @param fileName The name of the log file to use.
     * @throws IOException If the file cannot be created or opened.
     */
    public WriteAheadLog(String fileName) throws IOException {
        this.logFile = new File(fileName);
        // Open the file in append mode to preserve previous entries
        this.writer = new BufferedWriter(new FileWriter(logFile, true));
    }

    /**
     * Writes a single operation to the log file.
     * Each operation is flushed immediately to ensure durability.
     *
     * @param operation The string representing the operation (e.g., PUT, GET).
     * @throws IOException If writing to the file fails.
     */
    public void log(String operation) throws IOException {
        writer.write(operation);
        writer.newLine();    // Add newline to separate log entries
        writer.flush();      // Critical: force write to disk for durability
    }
}

👉 You can explore the full working prototype with recovery logic on GitHub

🔚 Final Thoughts

The Write-Ahead Log is one of the most fundamental techniques used in reliable storage systems. From PostgreSQL to Kafka, WAL ensures durability without sacrificing write performance.

💬 Let’s Discuss

Did you ever face a data loss incident?
Interested in WAL in distributed systems like Kafka?

Let me know in the comments!

The ByteFreak Blog

High-Watermark in Distributed Systems: A Deep Dive with Apache Kafka

Definitions First

A Concrete Walk-Through

The acks Settings Compared

Summary

Other Articles You May Like

When Bloom Filters Fail: False Positives, Memory Trade-offs, Production Lessons

False positives don’t break correctness—they shift the load

Memory sizing decisions age poorly

More hash functions aren’t always a fix

Observability matters more than configuration

When Bloom Filters may be the wrong choice

Summary

Bloom Filter: Definitely No, Probably Yes

Why Use Bloom Filter?

How It Works: The Bit Array

The “False Positive” Case

Real-World Use Cases

Summary

Other Articles You May Like

Concurrency vs. Parallelism: A Coffee Shop Guide for Developers

The Single-Core Era vs. The Multi-Core Revolution

Defining the Terms

The “Context Switch”

The Coffee Shop Analogy

Summary

Other Articles You May Like

From Modulo to Consistent Hashing: Optimizing Distributed Storage

⚙️ Design Goals for Distributed Systems

🛒 Use Case: Shopping Cart Service in a Global E-Commerce Site

🔢 Modulo Hashing

➕ Adding a New Node to the cluster

➖ Removing a Node from the cluster

🌐 Consistent Hashing: The Scalable Cure

➕ Adding a New Node to the cluster

➖ Removing a Node from the cluster

⚖️Virtual Nodes: Enhancing Balance and Stability

🛠️ Consistent Hashing Prototype

✅ Production Best Practices

💬 Final Thoughts & Discussion

📚 Other Articles You May Like

Understanding Write-Ahead Logs: Durability Beyond the Flush

🔧 What is WAL?

🧠 Internals: How WAL Works

Typical Write Path in a WAL-enabled Database

🔍 Advantages of WAL

📦 Conceptual WAL Entry Format

💾 Durability and fsync()

⚙️ WAL Prototype

🔚 Final Thoughts

💬 Let’s Discuss

The `acks` Settings Compared