<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[The ByteFreak Blog]]></title><description><![CDATA[Deep dives into distributed systems, backend engineering, system design, scalability patterns, and production hardening.]]></description><link>https://bytefreak.dev</link><generator>RSS for Node</generator><lastBuildDate>Thu, 09 Apr 2026 22:29:49 GMT</lastBuildDate><atom:link href="https://bytefreak.dev/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[High-Watermark in Distributed Systems: A Deep Dive with Apache Kafka]]></title><description><![CDATA[Let me start with a simple question. When a producer sends a message to Kafka, the leader broker writes it down to its local disk and sends the acknowledgment back. Is this considered done and availab]]></description><link>https://bytefreak.dev/high-watermark-in-apache-kafka</link><guid isPermaLink="true">https://bytefreak.dev/high-watermark-in-apache-kafka</guid><category><![CDATA[Apache Kafka]]></category><category><![CDATA[distributed systems]]></category><category><![CDATA[Software Engineering]]></category><category><![CDATA[System Design]]></category><dc:creator><![CDATA[Subhashish (Subh) Bhattacharjee]]></dc:creator><pubDate>Wed, 04 Mar 2026 05:32:21 GMT</pubDate><content:encoded><![CDATA[<p>Let me start with a simple question. When a producer sends a message to Kafka, the leader broker writes it down to its local disk and sends the acknowledgment back. Is this considered done and available for consumption?</p>
<p>Not quite.</p>
<p>What if the leader crashes before the in-sync replicas (ISRs) have yet to copy the message from the leader broker? If a consumer has consumed the message too eagerly, the system can no longer recover the message. From the consumer's point of view, the message existed, but from the system's point of view, after recovery, it never did.</p>
<p>Let's understand the sequence of steps in a 3 broker (Broker-1, Broker2, Broker-3) Kafka cluster with producer <code>acks=1, ISR=2, replication-factor=3</code></p>
<pre><code class="language-plaintext">1. Leader (Broker-1) writes message M, sends ack to producer.
2. A consumer too eagerly consumes message M.
3. Broker-1 crashes, M exists ONLY on Broker-1's disk.
4. Kafka elects a new leader from the ISR, say Broker-3.
5. Broker-3 doesn't have the message M. Cluster is now running without M.
6. Eventually, Broker-1 comes back online.
   But it comes back as a FOLLOWER, not a leader.
   It must sync itself to the new leader (Broker-3).
   Since Broker-3 doesn't have M, Broker-1 TRUNCATES its own log
   and deletes M to match.

→ M is now gone from every broker in the cluster. Permanently.
</code></pre>
<p>This is exactly the class of problem that High-Watermark (HWM) solves in Apache Kafka. And once you understand the concept, you will see it in other systems, at least conceptually, such as in Raft, in Zookeeper, in PostgreSQL replication.</p>
<h3>Definitions First</h3>
<p>Three important terms are often required to understand HWM in Kafka.</p>
<p><strong>Log End Offset (LEO)</strong> is simply the offset of the next message to be written. Think of it as the tip of the log. If a broker has messages at offsets 0, 1, 2, 3, 4, its LEO is 5.</p>
<p><strong>High-Watermark (HWM)</strong> is the highest message offset that has been fully replicated to every broker in the current in-sync replica set. Consumers can only read up to this point, never beyond it.</p>
<p><strong>In-Sync Replica (ISR)</strong> is the set of brokers (leader + followers) that are caught up to the leader. A follower falls out of the ISR if it stops fetching messages fast enough, controlled by the config <code>replica.lag.time.max.ms</code> (default: 30 seconds).</p>
<p>The most important relationship to remember, <code>HWM&lt;=LEO</code> always. The leader's LEO moves ahead with every write. The HWM index moves ahead only when all ISR members confirm they have replicated the latest data.</p>
<h3>A Concrete Walk-Through</h3>
<p>Let's use a simple setup through this post:</p>
<ul>
<li><p><strong>Replication factor=3</strong>: one leader (Broker-1) and two followers (Broker-2, Broker-3).</p>
</li>
<li><p><strong>ISR={Broker-1, Broker-2, Broker-3}</strong>, all three are in sync.</p>
</li>
<li><p><code>acks=1</code> The producer gets an acknowledgment as soon as the leader writes locally.</p>
</li>
</ul>
<ol>
<li><strong>The Starting State</strong></li>
</ol>
<p>Everything is caught up. All three brokers have messages A through E.</p>
<pre><code class="language-plaintext">Offset:          0    1    2    3    4
                [A]  [B]  [C]  [D]  [E]

Broker-1  (L):  [A]  [B]  [C]  [D]  [E]   LEO=5, HWM=5
Broker-2  (F):  [A]  [B]  [C]  [D]  [E]   LEO=5
Broker-3  (F):  [A]  [B]  [C]  [D]  [E]   LEO=5

Consumer can read: A, B, C, D, E (offsets 0 through 4)
</code></pre>
<p>HWM = 5 means "everything up to offset 4 is safe." Consumers are happy.</p>
<ol>
<li><strong>A New Message Arrives: Producer Writes F</strong></li>
</ol>
<p>The producer sends message F. With <code>acks=1</code>, the leader writes it locally and immediately sends back an acknowledgment to the producer. Job done from the producer's perspective.</p>
<pre><code class="language-plaintext">Offset:          0    1    2    3    4    5
                [A]  [B]  [C]  [D]  [E]  [F]

Broker-1  (L):  [A]  [B]  [C]  [D]  [E]  [F]     LEO=6, HWM=5 ← still 5!
Broker-2  (F):  [A]  [B]  [C]  [D]  [E]          LEO=5
Broker-3  (F):  [A]  [B]  [C]  [D]  [E]          LEO=5

Producer: Got acknowledgment. Thinks F is written.
Consumer can read: A, B, C, D, E but F is NOT visible yet.
</code></pre>
<p>Notice something important here, though the producer got its ack, but <strong>the HWM is still 5</strong>. Consumers still can't see F.</p>
<p>This is the key insight: <code>acks</code> and HWM are completely independent mechanisms. <code>acks</code> controls when the producer hears back. HWM controls when consumers are allowed to read. The leader knows it can't move the HWM forward yet because Broker-2 and Broker-3 haven't fetched F.</p>
<ol>
<li><strong>Replication Happens — But Not Evenly</strong></li>
</ol>
<p>Followers in Kafka continuously send fetch requests to the leader, pulling new messages. Let's say, Broker-2 is quick and fetches F right away. Broker-3 is slightly behind—maybe it had a brief GC pause or just a slower network.</p>
<pre><code class="language-plaintext">Offset:          0    1    2    3    4    5
                [A]  [B]  [C]  [D]  [E]  [F]

Broker-1  (L):  [A]  [B]  [C]  [D]  [E]  [F]   LEO=6, HWM=5
Broker-2  (F):  [A]  [B]  [C]  [D]  [E]  [F]   LEO=6
Broker-3  (F):  [A]  [B]  [C]  [D]  [E]        LEO=5
                                          ↑
                              HWM stays at 5.
Broker-3 is still in ISR and hasn't fetched F yet.

Consumer can read: A through E. F is still blocked.
</code></pre>
<p><strong>How does the leader know about the last message offset in Broker-3?</strong> When every follower sends a fetch request to the leader, it carries the offset the follower wants next. When Broker-3 sends <code>FetchRequest(offset=5)</code>, the leader knows it has everything up to offset 4. When Broker-2 sends <code>FetchRequest(offset=6)</code>, the leader knows Broker-2 has message F. The leader tracks the minimum across all ISR members, and that's the HWM.</p>
<ol>
<li><strong>Broker-3 Catches up → HWM Finally Advances</strong></li>
</ol>
<p>Broker-3 fetches F. Now all three ISR members have it. The leader advances the HWM to 6.</p>
<pre><code class="language-plaintext">Offset:          0    1    2    3    4    5
                [A]  [B]  [C]  [D]  [E]  [F]

Broker-1  (L):  [A]  [B]  [C]  [D]  [E]  [F]   LEO=6, HWM=6
Broker-2  (F):  [A]  [B]  [C]  [D]  [E]  [F]   LEO=6
Broker-3  (F):  [A]  [B]  [C]  [D]  [E]  [F]   LEO=6

Consumer can now read: A, B, C, D, E, F
</code></pre>
<p>F is now a committed message. Every ISR member has it. Even if the leader crashes right now, any follower that becomes the new leader will have F, and no data is lost.</p>
<p><strong>So What happens When the Leader Crashes Mid-Replication?</strong></p>
<p>Let's go back to the state where Broker-3 hadn't yet fetched F:</p>
<pre><code class="language-plaintext">Broker-1  (L):  [A]  [B]  [C]  [D]  [E]  [F]   ← crashes right here
Broker-2  (F):  [A]  [B]  [C]  [D]  [E]  [F]
Broker-3  (F):  [A]  [B]  [C]  [D]  [E]
                                         HWM = 5
</code></pre>
<p>Kafka now needs to elect a new leader from the ISR. Either Broker-2 or Broker-3 can win. Let's say Broker-3 wins the election.</p>
<pre><code class="language-plaintext">Broker-3 (New L): [A]  [B]  [C]  [D]  [E]   ← only has up to E. F is gone.
Broker-2  (F):    [A]  [B]  [C]  [D]  [E]  [F]  ← has F, but must truncate to match the new leader

New HWM = 5. F is permanently lost.
</code></pre>
<p>Broker-2 will truncate its own log to match the new leader, dropping F. The message is gone.</p>
<p><strong>But here's the important part</strong>: no consumer ever read F. It was above the HWM the entire time. From the consumer's perspective, nothing unusual happened. The system is consistent.</p>
<p>This is the trade-off you accept with <code>acks=1</code>. The producer got an ack, but the system made no durability promise. If you want F to survive any single broker failure, you need <code>acks=all</code>.</p>
<p>Now, let's say Broker-2 wins the election instead.</p>
<pre><code class="language-plaintext">Broker-2 (New L): [A]  [B]  [C]  [D]  [E]  [F] ← has F, becomes leader
Broker-3  (F):    [A]  [B]  [C]  [D]  [E]      ← catches up and fetches F

HWM advances to 6 once Broker-3 fetches F. F survives.
</code></pre>
<p>Whether F survives depends on which ISR member wins the election. This non-determinism is exactly why <code>acks=1</code> provides weaker durability. The producer receives an acknowledgment before replication, so a leader crash can still cause data loss.</p>
<h3>The <code>acks</code> Settings Compared</h3>
<p>Let's make this concrete with all three options:</p>
<p><code>acks=0</code>: Producer sends and moves on. No acknowledgment at all. Fastest, but any broker issue can lead to loss of data. Good for metrics or logs where occasional loss is acceptable.</p>
<p><code>acks=1</code>: Leader writes locally and acknowledges. Followers replicate in the background. If the leader crashes before replication, the message is lost. HWM still protects consumers from reading uncommitted data.</p>
<p><code>acks=all</code> (or <code>-1</code>) with <code>min.insync.replicas=2</code>: Leader waits until at least 2 ISR members have written the message before acknowledging. F would only get an ack after the leader and at least one follower confirmed it. Slower, but your data may survive a single broker failure.</p>
<p>The combination you want for anything important is <code>acks=all + min.insync.replicas=2 + replication.factor=3 + unclean.leader.election.enable=false</code>. These together provide strong durability guarantees and are the common production configuration for surviving a single broker failure.</p>
<p>Remember, Kafka's durability guarantee is based on <strong>HWM</strong>, not producer ACK. This means a producer can receive an acknowledgment for a message that is still above the HWM (<code>LEO &gt; HWM</code>) and therefore not yet fully committed.</p>
<h3>Summary</h3>
<p>The High-Watermark is Kafka's promise to consumers: "<em>Everything you read has been replicated to every broker in the ISR — you will never read a message the cluster can't recover.</em>"</p>
<p>The LEO always moves ahead. The HWM follows carefully behind. And the gap between them — that narrow window of messages written but not yet fully replicated — is the territory Kafka keeps hidden from consumers until it's safe.</p>
<p>Understanding this precisely gives you the ability to reason about <code>acks</code> settings, ISR configuration, and replication lag in a way that goes beyond config documentation. Understanding this concept well also ensures that, in a production Kafka cluster, you know what to look for when the producer gets an acknowledgement from the leader, but the consumers don't see the message.</p>
<h3><strong>Other Articles You May Like</strong></h3>
<ul>
<li><p>Dive deeper into the write-ahead log in <a href="https://bytefreak.dev/understanding-write-ahead-logs-durability-beyond-the-flush">WAL</a></p>
</li>
<li><p>Understand the Bloom filter in <a href="https://bytefreak.dev/bloom-filter">Bloom Filter: Definitely No, Probably Yes</a></p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[When Bloom Filters Fail: False Positives, Memory Trade-offs, Production Lessons]]></title><description><![CDATA[In my previous article, Bloom Filter: Definitely No, Probably Yes, we saw that a Bloom filter acts like a ‘magic’ toolbox to perform quick operations on large datasets to determine whether a value is certainly not in the set. However, this 'definitel...]]></description><link>https://bytefreak.dev/when-bloom-filters-fail</link><guid isPermaLink="true">https://bytefreak.dev/when-bloom-filters-fail</guid><category><![CDATA[distributed system]]></category><category><![CDATA[performance]]></category><category><![CDATA[System Design]]></category><category><![CDATA[Databases]]></category><category><![CDATA[observability]]></category><dc:creator><![CDATA[Subhashish (Subh) Bhattacharjee]]></dc:creator><pubDate>Wed, 11 Feb 2026 04:14:24 GMT</pubDate><content:encoded><![CDATA[<p>In my previous article, <a target="_blank" href="https://bytefreak.dev/bloom-filter">Bloom Filter: Definitely No, Probably Yes</a>, we saw that a Bloom filter acts like a ‘magic’ toolbox to perform quick operations on large datasets to determine whether a value is certainly not in the set. However, this 'definitely no, probably yes' nature, while enabling optimization, can become a silent killer if not designed with growth and observability in mind.</p>
<h3 id="heading-false-positives-dont-break-correctnessthey-shift-the-load">False positives don’t break correctness—they shift the load</h3>
<p>A bloom filter never returns false negatives, which makes it feel safe. But false positives still matter.</p>
<p>Imagine that you have implemented a cache-penetration bloom filter sized for 100M user IDs at 1% false positive (FP). The following year, the user base grows to 500M. The FP rate degrades to ~30%, meaning 30% of IDs now hit Redis and the DB anyway, negating the value of the bloom filter.</p>
<p>So, false positives don’t cause failures; they just shift the load to the downstream systems, which effectively negates the benefit of the filter.</p>
<h3 id="heading-memory-sizing-decisions-age-poorly">Memory sizing decisions age poorly</h3>
<p>Bloom filters are typically sized based on current data volume at deployment. However, as data grows, there is an increase in the number of false positives that reduces the efficacy.</p>
<p>Because there are no errors or alerts, the degradation goes unnoticed. A filter that seems sufficient today doesn't fail as data grows—it just becomes less effective, silently shifting the load downstream without raising alarms.</p>
<p>Hence, a bloom filter doesn’t end with deployment in production; it requires constant monitoring and scaling as data grows.</p>
<h3 id="heading-more-hash-functions-arent-always-a-fix">More hash functions aren’t always a fix</h3>
<p>When false positives increase, it’s tempting to increase the number of hash functions. This is an obvious choice, but we need to be aware of the impact it can have on the overall latency.</p>
<p>Each additional hash function increases CPU work and touches more memory. In a latency-sensitive path, this can increase the tail latency (p95/p99) even if false positives drop significantly. Research on LSM-trees shows that with fast storage (NVMs), Bloom filter hashing can dominate query latency and can become a significant bottleneck as key sizes grow.</p>
<p>Hence, increasing the hash functions can help up to a certain point, and performance needs to be measured based on the system’s latency requirements.</p>
<h3 id="heading-observability-matters-more-than-configuration">Observability matters more than configuration</h3>
<p>As you may have understood by now, implementing a bloom filter for a production use case is not effective without good observability metrics. A few of the important metrics are</p>
<ul>
<li><p>How often does the filter return positives?</p>
<p>  <code>bloom_positives_total / bloom_checks_total</code></p>
</li>
<li><p>How many downstream calls do those positives trigger?</p>
<p>  <code>(bloom_positives_total - bloom_true_positives) / bloom_positives_total</code></p>
</li>
<li><p>Does it still reduce load over time?</p>
<p>  <code>(db_calls_without_filter - db_calls_with_filter) / db_calls_without_filter</code></p>
</li>
</ul>
<h3 id="heading-when-bloom-filters-may-be-the-wrong-choice">When Bloom Filters may be the wrong choice</h3>
<ul>
<li><p>False positives are expensive or unacceptable: each false positive triggers a full cache + DB check. If your downstream system can’t handle the extra load, don’t use a Bloom filter.</p>
</li>
<li><p>The check lies in a latency-critical path: hashing overhead (especially with many hash functions) adds microseconds that matter in p99-sensitive APIs.</p>
</li>
<li><p>The data set is small enough: if your dataset fits in a few MB of Redis or can be indexed efficiently in Postgres or a similar system, a Bloom filter is overkill.</p>
</li>
</ul>
<h2 id="heading-summary">Summary</h2>
<p>Bloom filters are powerful tools, but they are not set-and-forget optimizations.</p>
<p>In real systems, their value depends on sizing, memory trade-offs, and observability. Normally, they reduce the load. If used casually, they quietly move problems downstream.</p>
]]></content:encoded></item><item><title><![CDATA[Bloom Filter:  Definitely No, Probably Yes]]></title><description><![CDATA[In large-scale distributed systems, knowing what you don’t have is often more valuable than knowing what you do.
Let’s understand with a practical example. Imagine you are building a recommendation engine for a blogging site like Hashnode or Medium, ...]]></description><link>https://bytefreak.dev/bloom-filter</link><guid isPermaLink="true">https://bytefreak.dev/bloom-filter</guid><category><![CDATA[distributed system]]></category><category><![CDATA[Databases]]></category><category><![CDATA[caching]]></category><category><![CDATA[performance]]></category><dc:creator><![CDATA[Subhashish (Subh) Bhattacharjee]]></dc:creator><pubDate>Tue, 03 Feb 2026 15:32:01 GMT</pubDate><content:encoded><![CDATA[<p>In large-scale distributed systems, knowing what you <em>don’t</em> have is often more valuable than knowing what you do.</p>
<p>Let’s understand with a practical example. Imagine you are building a recommendation engine for a blogging site like Hashnode or Medium, where users read blog posts. You want to show users fresh content they will love to check out, and you certainly don’t want to recommend them <em>articles they have already read.</em></p>
<p>You could query the primary database to fetch the user’s history and filter out already-seen articles. However, for a large-scale blog site, hitting the DB for every recommendation would be a waste of I/O. You can store this information (userId → list of articleIds) in a Redis cache. But let’s look at the cost of keeping such a large dataset in-memory:</p>
<p>Let’s do a back-of-the-envelope calculation:</p>
<ul>
<li><p><strong>100 Million Users</strong></p>
</li>
<li><p>Suppose each user has read <strong>100 articles (on average)</strong>.</p>
</li>
<li><p>ArticleId is a 64-bit-long UUID.</p>
</li>
</ul>
<p><code>100M users * 100 articles * 8 bytes = 80 GB</code> (ignoring overhead)</p>
<p>Though it’s manageable at this stage, it will only grow with time. This is where the <strong>Bloom Filter</strong> shines. It is a space-efficient <em>probabilistic</em> data structure that answers one question extremely fast: <em>“Is this element in the set? ”</em> and that too without storing the actual element. This increases lookup performance and, at the same time, may only occupy an order of magnitude less memory than storing all IDs directly.</p>
<p>The outcome is either of the two:</p>
<ol>
<li><p><strong>No</strong> (it is <em>definitely</em> not in the set).</p>
</li>
<li><p><strong>Maybe yes</strong> (it might be in the set or be a false positive).</p>
</li>
</ol>
<h2 id="heading-why-use-bloom-filter">Why Use Bloom Filter?</h2>
<p>Beyond saving memory in recommendation engines, Bloom filters can solve a critical security issue in backend systems: <strong>cache penetration.</strong></p>
<p>Let’s consider a standard <strong>Cache-Aside</strong> pattern:</p>
<ol>
<li><p>App requests data for <code>Key X</code></p>
</li>
<li><p>Cache Miss (Key X not found).</p>
</li>
<li><p>The app queries the database.</p>
</li>
<li><p>The app then populates the cache.</p>
</li>
</ol>
<p>Now, imagine an attacker (or a bug in the code) spams your API with random, nonexistent UUIDs.</p>
<ul>
<li><p>The cache misses (because the keys don't exist).</p>
</li>
<li><p>Every single request goes to the database.</p>
</li>
<li><p>The database doesn’t find it, so it returns “Not Found.”</p>
</li>
</ul>
<p>If the volume is high enough, this "thundering herd" of invalid requests can bring down the primary database.</p>
<p>You add a Bloom Filter in front of the cache. If the filter returns “No,” you trust it immediately and return <code>404 Not Found</code>. You don’t touch the cache, and <em>definitely not touch the database</em>.</p>
<h2 id="heading-how-it-works-the-bit-array">How It Works: The Bit Array</h2>
<p>Under the hood, a Bloom Filter doesn’t store the actual data; it stores 1s and 0s in a bit array to mark the presence of the value under consideration (a userId or userName, etc.). It has two main parts:</p>
<ol>
<li><p>An array of <code>m</code> bits, all initialized to <code>0</code>.</p>
</li>
<li><p><code>h</code> different hash functions.</p>
</li>
</ol>
<p><strong>Write Path:</strong><br />A simple example is to add user information to the bloom filter so as to later know whether the user exists in the system or not. Let’s add the user <code>cathy</code>. We pass her name through our <code>h=3</code> hash functions.</p>
<ul>
<li><p><code>h1("cathy") % m = 1</code></p>
</li>
<li><p><code>h2("cathy") % m = 3</code></p>
</li>
<li><p><code>h3("cathy") % m = 6</code></p>
</li>
</ul>
<p>We set those indices in our array to <strong>1</strong> (as shown in the figure below). That’s it.</p>
<p><strong>Read Path:</strong><br />Now, let’s check if <code>cathy</code> exists. We run through the same hashes and check if the values at indices 1, 3, and 6 are all <code>1</code> . If yes, then she <em>probably</em> exists.</p>
<p>Now, let's check <code>bob</code> who was never added.</p>
<ul>
<li><p><code>h1("bob") % m</code> points to <strong>1</strong> (the value at index 1 is <code>1</code> because of <code>cathy</code>).</p>
</li>
<li><p><code>h2("bob") % m</code> points to <strong>7</strong> (the value at index 7 is <code>0</code>).</p>
</li>
<li><p><code>h3("bob") % m</code> points to <strong>6</strong> (the value at index 6 is <code>1</code> because of <code>cathy</code>).</p>
</li>
</ul>
<p>Because the value at index 7 is <code>0</code> what we know for a fact that <code>bob</code> has never been added, we stop, and the result is <strong>“Definitely No.”</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1769155950453/63efbc6e-cb1b-4a29-be09-050726e3323a.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-the-false-positive-case">The “False Positive” Case</h2>
<p>So why did I say "Probably Yes" in the title?</p>
<p>Imagine we keep adding users until almost every box in the array is flipped to <code>1</code>. Eventually, we might check for a user named <code>jack</code>. By sheer coincidence, his hash values land on indices that are already populated with value <code>1</code> by <em>other</em> users.</p>
<ul>
<li><p>The filter sees all 1s.</p>
</li>
<li><p>It tells you, “Maybe Jack exists.”</p>
</li>
<li><p>But in reality, he doesn’t.</p>
</li>
</ul>
<p>This is a <strong>false positive.</strong> It’s the trade-off we need to make for speed and small memory usage. We can reduce these errors by making the array larger, using more hash functions, or using both, but we can never eliminate them 100%.</p>
<p>Standard Bloom filters don’t support deleting values (though it can be achieved by using a <em>Counting Bloom Filter)</em>. Once a bit is flipped to <code>1</code>, it stays <code>1</code>, because there is no way to know if that bit belongs to <code>cathy</code> or <code>bob</code>. This makes it perfect for read-heavy systems, or when changes are relatively append-only.</p>
<h2 id="heading-real-world-use-cases">Real-World Use Cases</h2>
<ol>
<li><p>As discussed, it can be used in cache penetration protection for requests with non-existent keys to not hammer the database every time there is a cache miss.</p>
</li>
<li><p>Used in web crawlers to check if a web page has already been crawled so as to avoid crawling again.</p>
</li>
<li><p>Used in systems such as Cassandra, HBase, etc., which use LSM-Tree-like structures. In Cassandra, data is first written to the memtable (an in-memory data structure), and then periodically, the data is flushed to disk in SSTables. Every SSTable maintains its own bloom filter. While you query the data, DB checks if the data is in the memtable; if not, then it checks the bloom filter for each SSTable (newest → oldest). If the bloom filter returns ‘not found,’ it entirely skips the index and data files and then checks the bloom filter for the second SSTable and so on. This reduces the disk search and hence drastically improves performance.</p>
</li>
</ol>
<h2 id="heading-summary">Summary</h2>
<p>Bloom filters are one of those “magic” tools that let you reject invalid requests instantly with a low memory footprint.</p>
<ul>
<li><p>No → Definitely not in the set (Trust 100%).</p>
</li>
<li><p>Maybe → May be in the set (check the DB or downstream system to confirm).</p>
</li>
</ul>
<h2 id="heading-other-articles-you-may-like"><strong>Other Articles You May Like</strong></h2>
<ul>
<li>Dive deeper into Consistent Hashing in <a target="_blank" href="https://hashnode.com/post/cmb7zwzgh001809jvhrte3x7b"><strong>From Modulo to Consistent Hashing</strong></a></li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Concurrency vs. Parallelism: A Coffee Shop Guide for Developers]]></title><description><![CDATA[If you ask ten developers to explain the difference between concurrency and parallelism, you might get ten slightly different answers. It’s one of those fundamental concepts that is easy to grasp abstractly but tricky to visualize in practice.
To und...]]></description><link>https://bytefreak.dev/concurrency-vs-parallelism-for-developers</link><guid isPermaLink="true">https://bytefreak.dev/concurrency-vs-parallelism-for-developers</guid><category><![CDATA[concurrency]]></category><category><![CDATA[Java]]></category><dc:creator><![CDATA[Subhashish (Subh) Bhattacharjee]]></dc:creator><pubDate>Thu, 22 Jan 2026 17:06:16 GMT</pubDate><content:encoded><![CDATA[<p>If you ask ten developers to explain the difference between concurrency and parallelism, you might get ten slightly different answers. It’s one of those fundamental concepts that is easy to grasp abstractly but tricky to visualize in practice.</p>
<p>To understand where we are today, we have to look at where we started.</p>
<h2 id="heading-the-single-core-era-vs-the-multi-core-revolution">The Single-Core Era vs. The Multi-Core Revolution</h2>
<p>Back in the “old days” of computing, we relied on single-core processors. Despite this limitation, computers still seemed to multitask. You could listen to music while typing a document, and it felt simultaneous. But it was rather an illusion.</p>
<p>The processor was frantically switching between tasks—giving a few milliseconds to the music player, then a few milliseconds to the word processor—so quickly that we humans couldn’t notice the gap. This is the foundation of <strong>threading</strong>.</p>
<p>Today, we have multi-core processors (dual-core, quad-core, octa-core, etc.) that can physically execute multiple instructions at the exact same instant. However, to utilize that power, we first need to design our software correctly.</p>
<h2 id="heading-defining-the-terms">Defining the Terms</h2>
<ul>
<li><p><strong>Concurrency</strong> is about <strong>structure</strong>. It is the composition of a program into small, independent tasks that <em>can</em> be executed out of order or in partial order.</p>
</li>
<li><p><strong>Parallelism</strong> is about <strong>execution</strong>. It is the simultaneous execution of distinct tasks.</p>
</li>
</ul>
<p>You can have concurrency without parallelism (the single-core example), but you generally cannot have parallelism without concurrency.</p>
<h2 id="heading-the-context-switch">The “Context Switch”</h2>
<p>In a concurrent system, the CPU has to save the state of the current thread (variables, instruction pointers) and load the state of the next thread. This is called a <strong>Context Switch</strong>.</p>
<p>Context switching is necessary for responsiveness, but it isn’t free. If you have a single processor and you spin up 1,000 threads, your computer might spend more time switching between them than actually doing the work!</p>
<h2 id="heading-the-coffee-shop-analogy">The Coffee Shop Analogy</h2>
<p>Let’s visualize this with a simple office breakroom scenario.</p>
<p><strong>Scenario A: Concurrent but NOT Parallel</strong><br />Imagine an office breakroom with <strong>two lines</strong> of developers but only <strong>one coffee machine</strong>.</p>
<ul>
<li><p>The developers are independent “tasks”.</p>
</li>
<li><p>The queues represent the structure (Concurrency).</p>
</li>
<li><p>The coffee machine is the CPU.</p>
</li>
</ul>
<p>Even though there are two lines, the coffee machine can only brew one cup at a time. It might serve the first person in Line A, then switch to the first person in Line B. This is concurrency. The tasks are progressing, and no single line is completely blocked, but they are sharing the same resource.</p>
<p><strong>Scenario B: Concurrent AND Parallel</strong><br />Now, management buys a <strong>second coffee machine</strong>.</p>
<ul>
<li><p>We still have the two lines (Concurrency).</p>
</li>
<li><p>But now, the person in Line A and the person in Line B can press “Brew” at the same instant.</p>
</li>
</ul>
<p>This is parallelism. Because we structured the problem correctly (separate queues), adding more hardware (the second coffee machine) instantly doubled our throughput. Let’s understand with a diagram.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1769079403213/b8b870a8-23b6-4b71-9e73-6158642308ba.png" alt class="image--center mx-auto" /></p>
<p>An example code in Java should clarify further.</p>
<pre><code class="lang-java"><span class="hljs-keyword">import</span> java.util.concurrent.ExecutorService;
<span class="hljs-keyword">import</span> java.util.concurrent.Executors;

<span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">CoffeeShop</span> </span>{
    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">static</span> <span class="hljs-keyword">void</span> <span class="hljs-title">main</span><span class="hljs-params">(String[] args)</span> </span>{
        <span class="hljs-comment">// Scenario 2: Parallelism (2 Coffee Machines -&gt; 2 Threads)</span>
        ExecutorService coffeeMachines = Executors.newFixedThreadPool(<span class="hljs-number">2</span>);

        Runnable makeCoffee = () -&gt; {
            String threadName = Thread.currentThread().getName();
            System.out.println(threadName + <span class="hljs-string">" is brewing coffee..."</span>);
            <span class="hljs-keyword">try</span> { 
                Thread.sleep(<span class="hljs-number">2000</span>); 
            } <span class="hljs-keyword">catch</span> (InterruptedException e) {} <span class="hljs-comment">// Simulate brewing</span>
            System.out.println(threadName + <span class="hljs-string">" is finished!"</span>);
        };

        <span class="hljs-comment">// Two people order at the same time</span>
        coffeeMachines.submit(makeCoffee);
        coffeeMachines.submit(makeCoffee);

        coffeeMachines.shutdown();
    }
}
</code></pre>
<p>If you run this, both “brewing” messages appear instantly. If you change the thread pool to <code>1</code> (concurrency), the second message would only appear after the first one finishes.</p>
<h2 id="heading-summary">Summary</h2>
<p>Designing for concurrency means structuring your program so that tasks don’t rely on each other unnecessarily. If you write a program where every step must happen sequentially (Line A must finish before Line B starts), you can never parallelize it, no matter how many cores your system has.</p>
<p><strong>Concurrency is about dealing with lots of things at once. Parallelism is about doing lots of things at once.</strong></p>
<h2 id="heading-other-articles-you-may-like">Other Articles You May Like</h2>
<ul>
<li>Dive deeper into Consistent Hashing in <a target="_blank" href="https://hashnode.com/post/cmb7zwzgh001809jvhrte3x7b">From Modulo to Consistent Hashing</a></li>
</ul>
]]></content:encoded></item><item><title><![CDATA[From Modulo to Consistent Hashing: Optimizing Distributed Storage]]></title><description><![CDATA[🔥 Ever tried scaling a single database past its limits? You’ll quickly encounter massive rebalance storms and downtime. While a single-server setup might handle initial workloads easily, expanding to tens or hundreds of millions of users demands dis...]]></description><link>https://bytefreak.dev/from-modulo-to-consistent-hashing-optimizing-distributed-storage</link><guid isPermaLink="true">https://bytefreak.dev/from-modulo-to-consistent-hashing-optimizing-distributed-storage</guid><category><![CDATA[distributed systems]]></category><category><![CDATA[consistent hashing]]></category><category><![CDATA[System Design]]></category><category><![CDATA[scalability]]></category><category><![CDATA[Databases]]></category><dc:creator><![CDATA[Subhashish (Subh) Bhattacharjee]]></dc:creator><pubDate>Wed, 28 May 2025 13:43:24 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/pFJtmoDMSAo/upload/43c0c0873dc1c2ae2407a3f26a2e5105.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>🔥 <strong>Ever tried scaling a single database past its limits</strong>? You’ll quickly encounter massive rebalance storms and downtime. While a single-server setup might handle initial workloads easily, expanding to tens or hundreds of millions of users demands distributed storage, bringing unique challenges to data management.</p>
<h2 id="heading-design-goals-for-distributed-systems">⚙️ Design Goals for Distributed Systems</h2>
<ul>
<li><p><strong>Uniform distribution:</strong> Avoid hotspots by evenly spreading data across nodes.</p>
</li>
<li><p><strong>High throughput:</strong> Scale horizontally for fast reads and writes.</p>
</li>
<li><p><strong>Elasticity:</strong> Add or remove nodes without disrupting service.</p>
</li>
<li><p><strong>Resilience:</strong> Handle node failures, network partitions, and unpredictable workloads.</p>
</li>
</ul>
<p>To achieve this, we need a mechanism that quickly determines data placement and minimizes shard rebalancing during cluster changes. Let's explore the need for consistent hashing with an illustrative example.</p>
<h2 id="heading-use-case-shopping-cart-service-in-a-global-e-commerce-site">🛒 Use Case: Shopping Cart Service in a Global E-Commerce Site</h2>
<p>Imagine you are building the shopping cart service for a global e-commerce site. Initially, with just a few thousand users, everything is simple and fits into a single node, as illustrated in <em>Figure 1</em>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748355556754/9d0860bd-0a42-474f-995a-865e5027366f.png" alt class="image--center mx-auto" /></p>
<p>Your business is a success, and now the user base is growing rapidly along with the data. This rapid growth soon necessitated data distribution across multiple nodes—a process known as <strong>sharding,</strong> as shown in <em>Figure 2</em>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748404886307/9670b295-c6f3-48a8-9347-b81e1b0e396a.png" alt class="image--center mx-auto" /></p>
<p>You store the information in a fast key-value store to ensure that actions like “Add to cart” and “Checkout” are quick and responsive. Here’s how a record may look in a key-value store:</p>
<pre><code class="lang-plaintext">Key   = "&lt;user_id&gt;"
Value = 
{
  "userId": "827391",
  "items": [
    {"productId": "SKU12345", "quantity": 2, "unitPrice": 25.99}
  ],
  "lastUpdated": "2025-05-27T14:32:15Z",
  "currency": "USD"
}
</code></pre>
<p>To scale, it’s necessary to shard the data, specifically <code>user_id</code> in our case. But how does the system decide the mapping between the shard ID and the server node? In other words, how do I know where to put my data?</p>
<h3 id="heading-modulo-hashing">🔢 Modulo Hashing</h3>
<p>Hashing deterministically converts variable‑length inputs into fixed‑size numeric values. A strong hash is <strong>fast</strong>, <strong>uniformly distributes</strong> outputs to avoid hotspots, and has a <strong>low collision rate</strong>—though it’s <strong>one‑way</strong>, so you can’t reverse it to recover the original data. Good hashing sets the stage for efficient sharding.</p>
<p>Let’s start our use case with a 3-node cluster (N0, N1, N2). Modulo hashing efficiently determines data placement:</p>
<pre><code class="lang-plaintext">nodes = ["N0", "N1", "N2"]
N = len(nodes)  # 3
db_node_id = nodes[hash(user_id) % N]

# Example assignments:
# user_id=42  -&gt; hash(42) % 3 = 0 → nodes[0] →  "N0"
# user_id=100 -&gt; hash(100) % 3 = 1 → nodes[1] → "N1"
# user_id=107 -&gt; hash(107)% 3 = 2 → nodes[2] →  "N2"
</code></pre>
<p>It’s a super-simple lookup. The distribution of shopping cart data among the three nodes in the database cluster is depicted in <em>Figure 3</em>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748357373724/cd06b7a4-3656-46c1-addf-07cce71f7eae.png" alt class="image--center mx-auto" /></p>
<blockquote>
<p><strong>Key Definitions</strong></p>
<ul>
<li><p><strong>Cluster</strong>: A group of server nodes working together to store and serve your data.</p>
</li>
<li><p><strong>Shard Manager</strong>: The component (often built into the database or cache) that maps each shard ID to a specific node in the cluster.</p>
</li>
</ul>
</blockquote>
<p>While modulo hashing is simple, it suffers significantly when scaling up or down.</p>
<h3 id="heading-adding-a-new-node-to-the-cluster">➕ Adding a New Node <strong>to the cluster</strong></h3>
<p>During the peak sale season, a new node is added to take up some load. When adding a 4th node, recalculations shift almost every user’s data, causing extensive rebalancing:</p>
<pre><code class="lang-plaintext">nodes = ["N0", "N1", "N2", "N3"]
N = len(nodes)  # 4
db_node_id = nodes[hash(user_id) % N]

# Example assignments:
# user_id=42  -&gt; hash(42) % 4 = 2 → nodes[2] →  "N2" (shifted)
# user_id=100 -&gt; hash(100) % 4 = 0 → nodes[0] → "N0" (shifted)
# user_id=107 -&gt; hash(107)% 4 = 3 → nodes[3] →  "N3" (shifted)
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748357413381/b828a8be-be3c-4845-9a5c-9aa6dd152e3d.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-removing-a-node-from-the-cluster">➖ Removing a Node <strong>from the cluster</strong></h3>
<p>Say, N0 is removed from the cluster, then the calculation of the new shard mapping will look something like this,</p>
<pre><code class="lang-plaintext">nodes = ["N1", "N2", "N3"]
N = len(nodes)  # 3
db_node_id = nodes[hash(user_id) % N]

# Example assignments:
# user_id=42  -&gt; hash(42) % 3 = 0 → nodes[0] →  "N1" (shifted)
# user_id=100 -&gt; hash(100) % 3 = 1 → nodes[1] → "N2" (shifted)
# user_id=107 -&gt; hash(107)% 3 = 2 → nodes[2] →  "N3" (unchanged)
</code></pre>
<p>Similarly, removing a node causes widespread data movement.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748406172105/af8f8495-d4d3-4ad4-aa04-033d77364cc5.png" alt class="image--center mx-auto" /></p>
<p>Since the calculation of which node the user ID must go to depends on the total number of active nodes in the cluster, the amount of data movement required when a node is added or removed is high. Frequent reshuffling in large systems can be inefficient, leading to downtime and performance hits.</p>
<h2 id="heading-consistent-hashing-the-scalable-cure">🌐 Consistent Hashing: The Scalable Cure</h2>
<p>Consistent hashing is a highly efficient hashing mechanism that is used in many large-scale distributed systems, such as Cassandra and DynamoDB.</p>
<p>Consistent hashing represents the whole key space as a logical ring. The size of the ring is determined by the cluster’s size; for our example, let’s keep it between 0 and 360 for ease of understanding. Each physical node is hashed using a hash function and placed in the corresponding positions on the hash ring as shown in <em>Figure 6</em>.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748411239268/512e23b7-b9f2-4dd5-901f-ad8cb0ae30c8.png" alt class="image--center mx-auto" /></p>
<blockquote>
<p>Just as vehicles choose the nearest exit on a roundabout, consistent hashing picks the ‘closest’ node on the ring to store each key.</p>
</blockquote>
<p>So, in a key-value store that uses consistent hashing, the keys are hashed using the same hash function, and the <strong>position of the node is the nearest node in the clockwise direction</strong> from the position of the key on the ring. If we consider our previous use case of shopping cart data, when we consider the same user IDs, the calculation is slightly changed as shown below.</p>
<pre><code class="lang-plaintext">db_node_id = hash(user_id) % ring_size (ring_size = 360)

#  user_id=42 → hash(42)%360=42 →   Node 0 [right of 42 is N0 at 90]
#  user_id=100 → hash(100)%360=100→  Node 1 [right of 100 is N1 at 220]
#  user_id=107→ hash(107)%360=107→  Node 1 [right of 107 is N1 at 220]
</code></pre>
<p><em>Figure 7</em> helps to clarify this further.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748406532356/0197b03e-b68c-4574-87e8-037812f3ab02.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-adding-a-new-node-to-the-cluster-1">➕ Adding a New Node <strong>to the cluster</strong></h3>
<p>Let’s understand with a diagram,</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748406680342/52b4e888-d017-4144-8950-8a1ed21cffff.png" alt class="image--center mx-auto" /></p>
<p>A few things happen when a new node is added as can be seen in <em>Figure 8</em>.</p>
<ul>
<li><p>The keys 100 and 107, which were previously part of node N1, are now part of the new node N3, requiring a remapping of keys from Node N1 to N3.</p>
</li>
<li><p>The existing data also needs to be moved from N1 to N3.</p>
</li>
</ul>
<h3 id="heading-removing-a-node-from-the-cluster-1">➖ Removing a Node <strong>from the cluster</strong></h3>
<p>Let’s understand with a diagram,</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748422438112/5716ead4-ffdb-4f9f-8b58-532d22c15e9d.png" alt class="image--center mx-auto" /></p>
<p>A few things happen when node N0 is removed as can be seen from <em>Figure 9</em>.</p>
<ul>
<li><p>The key 42, which was part of node N0, is now part of node N3, requiring a remapping of keys from Node N0 to N3.</p>
</li>
<li><p>The existing data also needs to be moved from N0 to N3.</p>
</li>
</ul>
<p>In both scenarios, there is minimal data movement required to adjust the cluster.</p>
<p>In our case, it’s a simple example, however, it illustrates the concept upon which consistent hashing is built. In our shopping cart use case for an internet-scale global e-commerce site, there can be hundreds of millions of users and thousands of nodes, and the number of shard IDs can be much higher than the number of physical nodes on the ring. This can lead to potential data skewness, resulting in hotspots.</p>
<p>For example, in <em>Figure 9</em>, if there are many keys whose positions on the ring are between 91 and 165, then all those will eventually land on N3, potentially making it a hotspot. Additionally, if N3 goes down, then all the load will shift to N1, which may overload and fail N1, in which case, the existing load on N1 will shift to N2, again overloading N2 and potentially causing the node to fail. This is called <em>cascading failure</em>. In order to circumvent cascading failure and to uniformly distribute data across physical nodes, there is a concept called a <em>Virtual Node</em>.</p>
<h2 id="heading-virtual-nodes-enhancing-balance-and-stability">⚖️Virtual Nodes: Enhancing Balance and Stability</h2>
<p><em>Virtual Node,</em> as the name suggests, is logical and is added to the consistent hashing ring to bring uniformity in data distribution and avoid cascading failures. Virtual nodes are essentially positions on the ring. Let’s see how.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748407933511/bd0559e6-3a46-45a5-a4ef-0d675daf75f9.png" alt class="image--center mx-auto" /></p>
<ul>
<li><p>There are three physical nodes in the cluster: N0, N1, and N2.</p>
</li>
<li><p>We create two virtual nodes for each physical node (N0-0, N0-1, etc.).</p>
</li>
<li><p>As a result, there will be more node positions on the hash ring.</p>
</li>
<li><p>This allows for uniformity in data and load distribution, as can be seen in <em>Figure 10</em>, thus reducing the chance of hotspots.</p>
</li>
<li><p>The system maintains a mapping between virtual nodes and physical nodes in the form of a Map <code>Map&lt;VirtualNode, PhysicalNode&gt;</code>.</p>
</li>
</ul>
<blockquote>
<p><strong>Key-on-node-edge</strong>: If a key’s hash exactly matches a vnode’s position (e.g., key at 90), it maps to that vnode. In our example, a key at 90 lands on the vnode at 90 rather than the next one at 104.</p>
</blockquote>
<h2 id="heading-consistent-hashing-prototype">🛠️ <strong>Consistent Hashing Prototype</strong></h2>
<p>Explore a working Java prototype demonstrating key operations:</p>
<pre><code class="lang-java"><span class="hljs-comment">// --- SimpleConsistentHashRing.java ---</span>
<span class="hljs-keyword">import</span> java.util.*;

<span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">SimpleConsistentHashRing</span> </span>{
    <span class="hljs-keyword">private</span> <span class="hljs-keyword">final</span> SortedMap&lt;Integer, String&gt; ring = <span class="hljs-keyword">new</span> TreeMap&lt;&gt;();
    <span class="hljs-keyword">private</span> <span class="hljs-keyword">final</span> <span class="hljs-keyword">int</span> N = <span class="hljs-number">360</span>;  <span class="hljs-comment">// ring size</span>

    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">addNode</span><span class="hljs-params">(String nodeId)</span> </span>{
        <span class="hljs-keyword">int</span> hash = Objects.hash(nodeId) % N;
        ring.put(hash, nodeId);
    }

    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">removeNode</span><span class="hljs-params">(String nodeId)</span> </span>{
        ring.values().removeIf(id -&gt; id.equals(nodeId));
    }

    <span class="hljs-function"><span class="hljs-keyword">public</span> String <span class="hljs-title">getNodeForKey</span><span class="hljs-params">(String key)</span> </span>{
        <span class="hljs-keyword">if</span> (ring.isEmpty()) <span class="hljs-keyword">return</span> <span class="hljs-keyword">null</span>;
        <span class="hljs-keyword">int</span> hash = Objects.hash(key) % N;

        SortedMap&lt;Integer, String&gt; tail = ring.tailMap(hash);
        <span class="hljs-keyword">return</span> tail.isEmpty() 
             ? ring.get(ring.firstKey()) 
             : ring.get(tail.firstKey());
    }

    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">static</span> <span class="hljs-keyword">void</span> <span class="hljs-title">main</span><span class="hljs-params">(String[] args)</span> </span>{
        SimpleConsistentHashRing ring = <span class="hljs-keyword">new</span> SimpleConsistentHashRing();
        ring.addNode(<span class="hljs-string">"A"</span>); 
        ring.addNode(<span class="hljs-string">"B"</span>); 
        ring.addNode(<span class="hljs-string">"C"</span>);

        System.out.println(<span class="hljs-string">"user:1001 → "</span> + ring.getNodeForKey(<span class="hljs-string">"user:1001"</span>));
        ring.addNode(<span class="hljs-string">"D"</span>);
        System.out.println(<span class="hljs-string">"user:1001 → "</span> + ring.getNodeForKey(<span class="hljs-string">"user:1001"</span>));
    }
}
</code></pre>
<h2 id="heading-production-best-practices">✅ Production Best Practices</h2>
<ol>
<li><p>Use a large, fixed ring size (e.g., 64-bit). Changing the ring size calls for a complete cluster rebalancing and is not efficient.</p>
</li>
<li><p>Select fast, non-cryptographic hash functions (e.g., Murmur hash)</p>
</li>
<li><p>Allocate sufficient virtual nodes to avoid data skewness and hot spots.</p>
</li>
</ol>
<h2 id="heading-final-thoughts-amp-discussion">💬 Final Thoughts &amp; Discussion</h2>
<p>Consistent hashing excels for stateful distributed storage, offering elasticity, resilience, and minimal rebalancing overhead. Curious about replication or why systems like Kafka can still scale using modulo hashing? Drop your thoughts in the comments below!</p>
<h2 id="heading-other-articles-you-may-like">📚 Other Articles You May Like</h2>
<ul>
<li>Dive deeper into Write-Ahead Logs in <a target="_blank" href="https://bytefreak.hashnode.dev/understanding-write-ahead-logs-durability-beyond-the-flush">Understanding Write-Ahead Logs: Durability Beyond the Flush</a></li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Understanding Write-Ahead Logs: Durability Beyond the Flush]]></title><description><![CDATA[Databases are a fundamental part of modern software architecture. Depending on the use case, we rely on different types — from relational databases like PostgreSQL to NoSQL systems like Cassandra, or even distributed log systems like Kafka.
But have ...]]></description><link>https://bytefreak.dev/understanding-write-ahead-logs-durability-beyond-the-flush</link><guid isPermaLink="true">https://bytefreak.dev/understanding-write-ahead-logs-durability-beyond-the-flush</guid><category><![CDATA[WriteAheadLog]]></category><category><![CDATA[DatabaseInternals]]></category><category><![CDATA[DataDurability]]></category><category><![CDATA[SystemsDesign]]></category><category><![CDATA[wal]]></category><dc:creator><![CDATA[Subhashish (Subh) Bhattacharjee]]></dc:creator><pubDate>Tue, 15 Apr 2025 04:09:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/stock/unsplash/y02jEX_B0O0/upload/e26fee74e4cbe3ed60e8a09a5b687f03.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Databases are a fundamental part of modern software architecture. Depending on the use case, we rely on different types — from relational databases like PostgreSQL to NoSQL systems like Cassandra, or even distributed log systems like Kafka.</p>
<p>But have you ever wondered what happens to your data when the database crashes? How does the system ensure that your committed data isn't lost?</p>
<p>This is where <strong>Write-Ahead Log (WAL)</strong> comes into play. In this blog, we’ll dive into the internals of WAL, explore how it works behind the scenes, and understand its critical role in ensuring data durability.</p>
<hr />
<h2 id="heading-what-is-wal">🔧 What is WAL?</h2>
<p>Every database uses an internal representation of data in memory — whether it’s based on <strong>B+ Trees</strong> or <strong>LSM Trees</strong>. When users issue commands to write or update records, these actions are first performed in memory and then periodically flushed to disk. This process is known as a <strong>checkpoint</strong>.</p>
<p>Since writes are batched before flushing, there’s always a risk of losing committed transactions if the system crashes before flushing to disk.</p>
<blockquote>
<p>💡 One might think: “Why not flush every transaction directly to disk?”<br />Because it’s <strong>inefficient</strong> — writing every transaction involves random disk seeks, index updates, and structural changes, which decreases the throughput.</p>
</blockquote>
<p><strong>WAL</strong> solves this problem by introducing an immutable <strong>append-only log file</strong>. Each write is first recorded in the WAL, then applied to in-memory data structures.</p>
<p>📝 Think of WAL like a diary — jotting down everything <em>before</em> making the actual changes. If the system crashes mid-way, the diary can help restore what was lost.</p>
<hr />
<h2 id="heading-internals-how-wal-works">🧠 Internals: How WAL Works</h2>
<p>Writing to a sequential log (WAL) is significantly faster than writing to structured files.</p>
<h3 id="heading-typical-write-path-in-a-wal-enabled-database">Typical Write Path in a WAL-enabled Database</h3>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744626215640/e6c8cf42-a3d0-4e5d-ab12-c211300614d3.png" alt class="image--center mx-auto" /></p>
<p><strong>Steps:</strong></p>
<ol>
<li><p>Write is appended to the WAL — durability guaranteed.</p>
</li>
<li><p>Change is applied to an in-memory structure (like a memtable or buffer pool).</p>
</li>
<li><p>Once memory crosses a threshold, data is flushed to disk (checkpoint).</p>
</li>
<li><p>Old WAL logs can be purged after checkpoint to reduce log size.</p>
</li>
</ol>
<hr />
<h2 id="heading-advantages-of-wal">🔍 Advantages of WAL</h2>
<ul>
<li><p>✅ <strong>Crash Recovery</strong>: Replays committed transactions from the WAL.</p>
</li>
<li><p>✅ <strong>Durability</strong>: Guarantees no data loss post-commit.</p>
</li>
<li><p>✅ <strong>Performance</strong>: Append-only writes are fast and sequential.</p>
</li>
<li><p>✅ <strong>Lazy Flushing</strong>: Flushes to disk in the background.</p>
</li>
<li><p>✅ <strong>Garbage Collection</strong>: Older WAL entries can be discarded post-checkpoint.</p>
</li>
<li><p>✅ <strong>Replication</strong>: WAL can be shipped to replicas for faster sync.</p>
</li>
</ul>
<hr />
<h2 id="heading-conceptual-wal-entry-format">📦 Conceptual WAL Entry Format</h2>
<p>A WAL entry typically stores:</p>
<ul>
<li><p>LSN (Log Sequence Number, a byte offset for every record)</p>
</li>
<li><p>Transaction ID</p>
</li>
<li><p>Operation Type</p>
</li>
<li><p>Table + Row ID</p>
</li>
<li><p>Before/After values</p>
</li>
<li><p>Timestamp</p>
</li>
<li><p>CRC32 (for integrity)</p>
</li>
</ul>
<pre><code class="lang-plaintext">
LSN: 00001234
TransactionID: 99768
Operation: UPDATE
Table: users
RowID: 26
Before: { age: 20 }
After:  { age: 21 }
Timestamp: 2025-04-10 15:12:10
CRC32: 0x5d41402abc4b2a76b9719d911017c592
</code></pre>
<p>The above data is for representational purpose only, in real system, the data is stored in the <strong>binary format</strong>. Additionally, <strong>CRC32</strong> is used as a <strong>checksum</strong> to ensure data integrity and usually calculated on the entire record.</p>
<hr />
<h2 id="heading-durability-and-fsync">💾 Durability and fsync()</h2>
<p>It’s important to note that a DB operation is <strong>not truly durable</strong> just because it’s written to the WAL in memory or even buffered by the OS. There are multiple layers between the application and the actual disk:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1744519096525/63bb7c52-75f0-4e2b-9a5f-3a9a898a877d.png" alt class="image--center mx-auto" /></p>
<p>To <strong>ensure durability</strong>, systems call <code>fsync()</code> (or similar system calls) to force the WAL to be flushed from the OS cache all the way to disk.</p>
<p>Every layer in the write path uses <strong>write buffering</strong> to improve performance, so calling <code>fsync()</code> tells the OS: <em>“Please flush this data now.”</em> However, frequent <code>fsync()</code> calls come at the cost of throughput. Many systems (like Kafka, PostgreSQL) <strong>batch writes and fsync periodically</strong> to strike a balance between durability and throughput.</p>
<hr />
<h2 id="heading-wal-prototype">⚙️ WAL Prototype</h2>
<p>To solidify the concepts, I’ve built a simple WAL prototype in Java, showing:</p>
<ul>
<li><p>Append-only log writes</p>
</li>
<li><p>Basic recovery logic</p>
</li>
</ul>
<p>Below code snippet shows a simple WAL that logs different operations and flush changes to disk.</p>
<pre><code class="lang-java"><span class="hljs-keyword">public</span> <span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">WriteAheadLog</span> </span>{
    <span class="hljs-keyword">private</span> <span class="hljs-keyword">final</span> File logFile;
    <span class="hljs-keyword">private</span> <span class="hljs-keyword">final</span> BufferedWriter writer;

    <span class="hljs-comment">/**
     * Initializes the Write-Ahead Log with a given file name.
     * Creates or appends to the file if it already exists.
     *
     * <span class="hljs-doctag">@param</span> fileName The name of the log file to use.
     * <span class="hljs-doctag">@throws</span> IOException If the file cannot be created or opened.
     */</span>
    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-title">WriteAheadLog</span><span class="hljs-params">(String fileName)</span> <span class="hljs-keyword">throws</span> IOException </span>{
        <span class="hljs-keyword">this</span>.logFile = <span class="hljs-keyword">new</span> File(fileName);
        <span class="hljs-comment">// Open the file in append mode to preserve previous entries</span>
        <span class="hljs-keyword">this</span>.writer = <span class="hljs-keyword">new</span> BufferedWriter(<span class="hljs-keyword">new</span> FileWriter(logFile, <span class="hljs-keyword">true</span>));
    }

    <span class="hljs-comment">/**
     * Writes a single operation to the log file.
     * Each operation is flushed immediately to ensure durability.
     *
     * <span class="hljs-doctag">@param</span> operation The string representing the operation (e.g., PUT, GET).
     * <span class="hljs-doctag">@throws</span> IOException If writing to the file fails.
     */</span>
    <span class="hljs-function"><span class="hljs-keyword">public</span> <span class="hljs-keyword">void</span> <span class="hljs-title">log</span><span class="hljs-params">(String operation)</span> <span class="hljs-keyword">throws</span> IOException </span>{
        writer.write(operation);
        writer.newLine();    <span class="hljs-comment">// Add newline to separate log entries</span>
        writer.flush();      <span class="hljs-comment">// Critical: force write to disk for durability</span>
    }
}
</code></pre>
<p>👉 <strong>You can explore the full working prototype with recovery logic on</strong> <a target="_blank" href="https://github.com/sbcharr/wal-demo">GitHub</a></p>
<hr />
<h2 id="heading-final-thoughts">🔚 Final Thoughts</h2>
<p>The Write-Ahead Log is one of the most fundamental techniques used in reliable storage systems. From PostgreSQL to Kafka, WAL ensures durability without sacrificing write performance.</p>
<h2 id="heading-lets-discuss">💬 Let’s Discuss</h2>
<ul>
<li><p>Did you ever face a data loss incident?</p>
</li>
<li><p>Interested in WAL in distributed systems like Kafka?</p>
</li>
</ul>
<p>Let me know in the comments!</p>
]]></content:encoded></item></channel></rss>