s3bw : Probabilistic Data Structures

Tue 21 April 2026

Probabilistic Data Structures

Keeping track of a constant set of items is fairly straight forward, however when the number of items start to grow larger than the capacity of a single machine things get expensive. There's a way around this and it's to rely on approximations instead of concrete numbers. There are two probabilistic data structures I'd like to cover in this post; Bloom Filters and Count-Min Sketch.

Generally these are applied when space is a constraint and you need predictable and consistent size. If you're counting or caching at scale there might be a chance that your database is relying on probability instead of certainty.

Hash Tables

A fundamental data structure in computer science is the hash table, useful in caching and counting, as well as representing objects in software.

A large hash table is often useful when you need a key value store such as one that would map user ID to a profile picture so that every request for a profile picture is speedy since hash tables operate in 0(1) for look ups.

This is done by having a number of buckets and a function that consistently converts a key to the index of a bucket. This is called the hash function. A cache or a key value store only requires computing the hash to locate the data.

If we were to hash the key "fox" (hash("fox")) and the resulting output was 5 we would know our data is in bucket 5.

Hash Tables

Hash functions won't compute unique hashes and occasionally they can collide with existing keys stored in the hash table. So both hash("fox") and hash("cat") might end up pointing to bucket 5. They can reduce the chance of this happening by increasing the number of buckets. Having 10 keys and 1 million buckets means the chance of collision becomes extremely small.

In practice hash tables store linked lists in the bucket locations and when a collision occurs they iterate through the list until it finds the key. When storing a new key it appends to the end of the list if it's not there already. Redis uses this technique in addition to resizing the number of buckets dynamically in-order-to keep the length of these lists to a minimum.

We can use hash tables to determine if we've seen something before, by storing keys as we see them, if the key exists in the hash table we know that this isn't the first time we've seen the item.

Bloom Filters

When we start dealing with data streams or billions of users, storing everything in memory can be expensive. Instead we can reduce the total memory consumed by using a probabilistic data structure; the bloom filter.

Bloom filters rely on approximate set inclusion. So instead of yes this item is in the set or no it isn't; we get the following outcome:

This item is not in the set.
This item might be in the set.

This is done by having a consistent number of buckets and using more than one hash function. As you can see in the illustration below, the key is hashed three times and each resultant bucket is set to 1.

Bloom Filters

When items are queried we will know for sure that we haven't seen it before if any of the buckets return a 0. However if all the buckets result in a collision then we know that we might have seen it before.

Both the number of buckets and hashes can be configured which allows us to trade more space for a reduction in the probability that we return false negatives. (We might have seen it, when we haven't).

The bloom filter is applied in situations where space is limited and keeping track of every element isn't an option. If we wish to avoid making expensive queries for data that doesn't exist; a bloom filter can help us reduce the number of expensive queries.

Browsers have used bloom filters in the past by providing a preset filter of malicious URLs. When we visit a URL and it's not included in the filter we can proceed however if it might be in the filter we can query a server to help determine if it's safe or not. We avoid this query on the majority of URLs as most URLs are safe.

Count-Min Sketch

The last probabilistic data structure I'd like to cover is the Count-Min Sketch, like a bloom filter it has multiple hash functions, unlike the bloom filter it tracks the number of times a key lands in a bucket.

When queried it hashes the key and returns the minimum from the counts stored in the corresponding buckets. This allows us to determine an upper bound estimate for the number of times we've seen a key.

Count Min Sketch

Count-Min Sketch is useful in large scale data processing, for example if we are interested in tracking the top-k searches we can do this normally by using a heap. If we have size restrictions and need to use constant space instead of O(n) space we can put the sketch in front of the heap.

Heap inserts are done in log(n) time which we can avoid doing if we know the item shouldn't be in the heap. Items that appear infrequently are then discarded before even making it to the heap. We do this by querying the sketch for an upper bound of the new item, for example 3, and if the kth item in our heap has 12 appearances then we can avoid adding the new item to the heap.

I have found it interesting that at scale we can use probability to optimise our systems but it also requires an understanding of how the data is distributed. These data structures work well on long tail distributions but when all items are as frequent as each other these become less useful. It would be interesting to discover how systems can map to different distributions of data and how these structures are set up to solve the given problems.