Bloom Filters: Scaling Distributed Systems

In large-scale distributed systems, knowing what you don’t have is often more valuable than knowing what you do.

Let’s understand with a practical example. Imagine you are building a recommendation engine for a blogging site like Hashnode or Medium, where users read blog posts. You want to show users fresh content they will love to check out, and you certainly don’t want to recommend them articles they have already read.

You could query the primary database to fetch the user’s history and filter out already-seen articles. However, for a large-scale blog site, hitting the DB for every recommendation would be a waste of I/O. You can store this information (userId → list of articleIds) in a Redis cache. But let’s look at the cost of keeping such a large dataset in-memory:

Let’s do a back-of-the-envelope calculation:

100 Million Users
Suppose each user has read 100 articles (on average).
ArticleId is a 64-bit-long UUID.

100M users * 100 articles * 8 bytes = 80 GB (ignoring overhead)

Though it’s manageable at this stage, it will only grow with time. This is where the Bloom Filter shines. It is a space-efficient probabilistic data structure that answers one question extremely fast: “Is this element in the set? ” and that too without storing the actual element. This increases lookup performance and, at the same time, may only occupy an order of magnitude less memory than storing all IDs directly.

The outcome is either of the two:

No (it is definitely not in the set).
Maybe yes (it might be in the set or be a false positive).

Why Use Bloom Filter?

Beyond saving memory in recommendation engines, Bloom filters can solve a critical security issue in backend systems: cache penetration.

Let’s consider a standard Cache-Aside pattern:

App requests data for Key X
Cache Miss (Key X not found).
The app queries the database.
The app then populates the cache.

Now, imagine an attacker (or a bug in the code) spams your API with random, nonexistent UUIDs.

The cache misses (because the keys don't exist).
Every single request goes to the database.
The database doesn’t find it, so it returns “Not Found.”

If the volume is high enough, this "thundering herd" of invalid requests can bring down the primary database.

You add a Bloom Filter in front of the cache. If the filter returns “No,” you trust it immediately and return 404 Not Found. You don’t touch the cache, and definitely not touch the database.

How It Works: The Bit Array

Under the hood, a Bloom Filter doesn’t store the actual data; it stores 1s and 0s in a bit array to mark the presence of the value under consideration (a userId or userName, etc.). It has two main parts:

An array of m bits, all initialized to 0.
h different hash functions.

Write Path:
A simple example is to add user information to the bloom filter so as to later know whether the user exists in the system or not. Let’s add the user cathy. We pass her name through our h=3 hash functions.

h1("cathy") % m = 1
h2("cathy") % m = 3
h3("cathy") % m = 6

We set those indices in our array to 1 (as shown in the figure below). That’s it.

Read Path:
Now, let’s check if cathy exists. We run through the same hashes and check if the values at indices 1, 3, and 6 are all 1 . If yes, then she probably exists.

Now, let's check bob who was never added.

h1("bob") % m points to 1 (the value at index 1 is 1 because of cathy).
h2("bob") % m points to 7 (the value at index 7 is 0).
h3("bob") % m points to 6 (the value at index 6 is 1 because of cathy).

Because the value at index 7 is 0 what we know for a fact that bob has never been added, we stop, and the result is “Definitely No.”

The “False Positive” Case

So why did I say "Probably Yes" in the title?

Imagine we keep adding users until almost every box in the array is flipped to 1. Eventually, we might check for a user named jack. By sheer coincidence, his hash values land on indices that are already populated with value 1 by other users.

The filter sees all 1s.
It tells you, “Maybe Jack exists.”
But in reality, he doesn’t.

This is a false positive. It’s the trade-off we need to make for speed and small memory usage. We can reduce these errors by making the array larger, using more hash functions, or using both, but we can never eliminate them 100%.

Standard Bloom filters don’t support deleting values (though it can be achieved by using a Counting Bloom Filter). Once a bit is flipped to 1, it stays 1, because there is no way to know if that bit belongs to cathy or bob. This makes it perfect for read-heavy systems, or when changes are relatively append-only.

Real-World Use Cases

As discussed, it can be used in cache penetration protection for requests with non-existent keys to not hammer the database every time there is a cache miss.
Used in web crawlers to check if a web page has already been crawled so as to avoid crawling again.
Used in systems such as Cassandra, HBase, etc., which use LSM-Tree-like structures. In Cassandra, data is first written to the memtable (an in-memory data structure), and then periodically, the data is flushed to disk in SSTables. Every SSTable maintains its own bloom filter. While you query the data, DB checks if the data is in the memtable; if not, then it checks the bloom filter for each SSTable (newest → oldest). If the bloom filter returns ‘not found,’ it entirely skips the index and data files and then checks the bloom filter for the second SSTable and so on. This reduces the disk search and hence drastically improves performance.

Summary

Bloom filters are one of those “magic” tools that let you reject invalid requests instantly with a low memory footprint.

No → Definitely not in the set (Trust 100%).
Maybe → May be in the set (check the DB or downstream system to confirm).

Bloom Filter: Definitely No, Probably Yes

Why Use Bloom Filter?

How It Works: The Bit Array

The “False Positive” Case

Real-World Use Cases

Summary

Other Articles You May Like

Comments

Building Blocks of Modern Data Systems

When Bloom Filters Fail: False Positives, Memory Trade-offs, Production Lessons

More from this blog

High-Watermark in Distributed Systems: A Deep Dive with Apache Kafka

When Bloom Filters Fail: False Positives, Memory Trade-offs, Production Lessons

Concurrency vs. Parallelism: A Coffee Shop Guide for Developers

From Modulo to Consistent Hashing: Optimizing Distributed Storage

Command Palette

Why Use Bloom Filter?

How It Works: The Bit Array

The “False Positive” Case

Real-World Use Cases

Summary

Other Articles You May Like

Comments

Building Blocks of Modern Data Systems

When Bloom Filters Fail: False Positives, Memory Trade-offs, Production Lessons

More from this blog