Bounded Wrongness: A Field Guide to Probabilistic Data Structures

Your database keeps a perfect index of every key it holds. Ask whether some key is present and it answers with total certainty: yes, here's the row, or no, it isn't here.

Now look at what it actually gets asked. In a write-heavy store, a huge fraction of lookups are for keys that aren't there, and the expensive part of answering "no" is the disk read you do to prove it. That read is exactly what Cassandra, RocksDB, Redis, and (for a while) Chrome all decided to skip. Each keeps a second, much smaller structure whose entire job is to be wrong, on purpose, at a rate they choose. It never sends you to disk for a key that's present. It occasionally sends you to disk for a key that isn't. They consider that an excellent deal.

TL;DR

This is a tour of the probabilistic data structures hiding inside systems you already run. The Bloom filter (1970) and its descendants (Counting, Cuckoo, Quotient, XOR, Binary Fuse) answer set membership; HyperLogLog and Count-Min answer "how many distinct" and "how often." All of them make the same trade: a bounded, one-sided, tunable error in exchange for a space collapse the exact structure can't touch. The skill isn't the formulas. It's knowing which question you can afford to ask slightly wrong, and by how much.

The negative oracle

Here's the reframe that makes the rest of this make sense: a Bloom filter is not an index that finds things. It's a gate that tells you, cheaply and with certainty, when something is definitely absent.

That word "certainty" is doing real work, and it only points one direction. A Bloom filter has two possible answers, and they are not symmetric. "Definitely not present" is the truth, always, no exceptions. "Maybe present" means go check, and sometimes the check comes up empty. There are false positives. There are never false negatives. If the filter says a key isn't there, it isn't there, full stop.

Sit with why that asymmetry is the whole game. When Cassandra looks for a partition, it consults a per-SSTable Bloom filter first. If the filter says "definitely not in this file," Cassandra skips the file entirely, no disk seek. If it says "maybe," Cassandra reads the file to find out. A false positive costs one wasted read. A false negative would cost correctness: it would tell Cassandra a row doesn't exist when it does, and you'd silently lose data. The structure is built so the cheap error can happen and the expensive one cannot.

The filter says	What it means	What you do	Cost when wrong
"Definitely not present"	Always true	Skip the lookup	Never wrong
"Maybe present"	Probably true	Do the real lookup	One wasted lookup

So the filter isn't storing your data. It's storing a compressed, lossy shadow of which keys you've seen, and it's right about absence every single time. I want a name for this trade, because it runs through every structure in this post: bounded wrongness. The error is real, but it's one-sided, it's capped, and you set the cap. It's a price you pay on purpose, not a risk that happens to you.

The 1970 idea, and the math that runs it

Burton Bloom published the whole thing in 1970, in a two-and-a-half-page CACM paper about hyphenation. He had a dictionary of words that needed special hyphenation rules, it was too big to keep in core memory, and most words don't need a special rule. He wanted a fast way to say "this word is definitely ordinary, don't bother looking it up." A negative oracle, in other words, for a problem nobody would call glamorous.

The mechanism fits in three sentences. Take a bit array of $m$ bits, all zero, and $k$ independent hash functions. To insert an item, hash it $k$ ways and set those $k$ bits to 1. To query, hash it $k$ ways and check those bits: if any is 0 the item is definitely absent, and if all are 1 it's probably present. Here it is in twelve lines of Python, double-hashing to fake $k$ hash functions from one:

python

import hashlibclass BloomFilter:    def __init__(self, m: int, k: int):        self.m, self.k = m, k        self.bits = bytearray(m // 8 + 1)    def _indices(self, item: str):        h = hashlib.sha256(item.encode()).digest()        h1, h2 = int.from_bytes(h[:8]), int.from_bytes(h[8:16])        return [(h1 + i * h2) % self.m for i in range(self.k)]    def add(self, item: str):        for i in self._indices(item):            self.bits[i // 8] |= 1 << (i % 8)    def __contains__(self, item: str) -> bool:        return all(self.bits[i // 8] >> (i % 8) & 1 for i in self._indices(item))

That all(...) is the negative oracle: one clear bit and the answer is a hard no. The interesting question is how often "all set" lies. With $n$ items inserted, the false-positive rate is

$FPR \approx (1 - e^{- k n / m})^{k} .$

Two knobs fall out of that. For a fixed amount of memory per item, there's an optimal number of hash functions, and it's a clean expression:

$k^{*} = \frac{m}{n} ln 2.$

More hashes is not more better. Past $k^{*}$ you're setting so many bits that the array saturates and the false-positive rate climbs back up. Plug $k^{*}$ back in and the best achievable rate depends only on bits-per-element $b = m / n$ :

$FPR_{opt} (b) = 2^{- b l n 2} \approx 0.618 5^{b} .$

This is the curve worth feeling rather than reading. Drag the number of hash functions and watch the rate dip toward the best-possible envelope and then climb back out the other side.

Run the numbers the other way, solving for the memory a target rate $p$ demands, and you get the figure worth memorizing:

$\frac{m}{n} = - \frac{l n p}{( l n 2 ) ^{2}} \approx - 1.44 lo g_{2} p .$

A 1% false-positive rate costs about 9.6 bits per element, with $k \approx 7$ . Each additional order of magnitude (down to 0.1%, then 0.01%) costs only about 4.8 more bits per element. And that $1.44$ factor isn't free real estate: the information-theoretic floor for an approximate membership filter is $lo g_{2} (1/ p)$ bits per element, so a Bloom filter is permanently 44% above optimal, by construction. Hold onto that 44%. Half the structures later in this post exist to claw it back.

The part that surprises people: none of this depends on how big your items are. Ten-character keys or ten-kilobyte ones, it's the same 9.6 bits each for 1%, because you only ever store hashes.

Where it actually runs

This is not a whiteboard structure. It's load-bearing in software you almost certainly depend on.

Cassandra keeps a Bloom filter per SSTable so a read can skip files that don't contain the partition; the default bloom_filter_fp_chance is 0.01, the off-heap memory is sized straight off the formula above. RocksDB and LevelDB do the same per SST file, and RocksDB has since shipped the Ribbon filter as a drop-in that runs about 30% smaller (roughly 7 bits per key at 1% instead of ~10) in exchange for more CPU at build time. Redis ships a native HyperLogLog and, since Redis 8, Bloom, Cuckoo, Count-Min, and Top-K as core types. BigQuery's APPROX_COUNT_DISTINCT and the HLL_COUNT.* functions are HyperLogLog under the hood, and the same machinery counts uniques in Google Analytics.

System	Structure	What it buys
Cassandra	Bloom per SSTable	Skips disk reads for absent partitions (default 1% FPR)
RocksDB / LevelDB	Bloom or Ribbon per SST file	Same, ~30% smaller with Ribbon
Redis	HyperLogLog + Bloom / Cuckoo / Count-Min	Cardinality and membership as core types
BigQuery	HyperLogLog++	`APPROX_COUNT_DISTINCT` without a full scan
Chrome (pre-2012)	Bloom of malicious-URL prefixes	Local Safe Browsing checks (since replaced)

The most instructive deployment is the one that got ripped out. Early Chrome kept an in-memory Bloom filter of malicious-URL prefixes for Safe Browsing, so the browser could check most URLs locally and only phone home on a "maybe." Around 2012 they pulled it, replacing it with a Golomb-compressed sorted prefix set that packed the same data tighter and behaved better under updates. That's the honest footnote to the whole genre: the trade is real, the trade has limits, and the limits are knowable. A Bloom filter is the right answer often, not always.

Fixing Bloom's three omissions

Bloom's 1970 design has three specific shortcomings, and almost the entire family tree below is people fixing one of them while paying for it somewhere else.

Loading diagram...

The first omission is deletion. A classic Bloom filter can't remove an item, because clearing its $k$ bits might clear a bit some other item still needs, and now you have a false negative, the one error that's supposed to be impossible. Counting Bloom filters fix it by replacing each bit with a small counter (about 4× the space). Cuckoo and Quotient filters fix it more cheaply by storing a short fingerprint of each item instead of a smear of bits, which you can match and delete exactly.

The second is cache locality. Bloom's $k$ hashes scatter across the whole array, so a single lookup can mean $k$ cache misses. Cuckoo filters confine each item to one of two buckets; Quotient filters keep everything contiguous and SSD-friendly. The Cuckoo trick is worth one line: an item's two candidate buckets are $i_{1} = h (x)$ and $i_{2} = i_{1} \oplus h (fingerprint)$ , so when you need to evict and rehome an item you can find its alternate bucket knowing only the fingerprint already sitting there.

The third is that 44% tax. XOR filters get it down to about 23% over the floor and Binary Fuse filters to about 13%, the current record. The catch is that both are static: you build them once from a known set and query forever, with no inserts afterward. Pick your filter by what you're willing to give up, not by the name you recognize.

Before moving on, it's worth watching a false positive actually happen, because the mechanism explains why false negatives can't. Step through two inserts and one unlucky query:

The other branch: counting without counting

Change the question and the same bargain reappears in a different shape. Instead of "is X in the set," ask "how many distinct things have I seen," or "how often have I seen this one." That's the streaming-sketch branch, and it's where the space savings stop being clever and start being absurd.

HyperLogLog answers count-distinct. The intuition is a party trick: hash every item to a random bitstring and watch for the longest run of leading zeros you ever see. A run of 10 leading zeros is a 1-in-1024 event, so seeing one suggests you've hashed on the order of 1000 distinct items. One observation is noisy, so HLL splits the stream across many registers and takes a (bias-corrected) harmonic mean. The standard error is set entirely by the register count $m$ :

$σ \approx \frac{1.04}{m} .$

Redis uses 16,384 registers, which lands the error at about 0.81%, in a flat 12 kilobytes that never grows, whether you're counting a thousand unique visitors or four billion. Counting four billion distinct items exactly would need gigabytes. HLL does it in the space of this paragraph rendered as a PNG, and the memory is pinned by your error target, not by the cardinality.

Count-Min Sketch answers frequency. It's a small $d \times w$ grid of counters; each item gets hashed into one cell per row and increments it, and the estimate for an item is the minimum across its rows. The minimum matters: collisions can only push a counter up, so the estimate is biased high and never low. Formally, with total stream mass $N$ , the error stays within $εN$ with high probability:

$Pr [\overset{a}{^}_{i} \leq a_{i} + εN] \geq 1 - δ .$

Same family resemblance as the membership branch: the error is one-sided (never an underestimate), it's bounded ( $εN$ ), and you buy down both $ε$ and the confidence $δ$ with grid size. You're pricing wrongness again, just in a different currency.

When the bounded error is a feature and the bounded leak is a bug

I've been selling this hard, so here's the part where it bites.

The failure mode of a probabilistic structure isn't always a false positive. Sometimes it's what the structure quietly reveals. Bitcoin's BIP37 is the cautionary tale: light wallets sent a Bloom filter of the addresses they cared about to full nodes, so they'd receive only matching transactions without naming their addresses outright. The bounded error worked exactly as designed. The problem was that the filter leaked enough structure that an observer could often reconstruct which addresses a wallet was watching, which is the precise thing the filter was meant to hide. The feature was sound; the side channel sank it. BIP37 is effectively deprecated.

There are quieter traps too. False positives compound: stack three 1% filters in a pipeline and the end-to-end miss rate is not 1% anymore. Static filters (XOR, Binary Fuse) are wonderful until someone asks you to add one element and you have to rebuild the whole thing. A mistuned $k$ silently inflates your false-positive rate with no error to catch it. And an adversary who can choose your inputs can drive collisions on purpose, turning your tidy 1% into something much worse.

None of this is an argument against the structures. It's an argument for treating them as a contract. When you reach for one you owe whoever's on call three things in writing: the error budget, the failure mode, and what a wrong answer costs the thing downstream. "Probabilistic" is fine. "Probabilistic, and nobody wrote down the budget" is how you get paged.

Where this leaves you

The reframe is the whole point. "Probabilistic" is a discount you negotiate, not a reliability problem you inherited. If you've been mentally filing these structures under "neat, but risky," here's what to do with them instead.

Reach for a filter when the common answer is "no." A negative oracle in front of an expensive lookup (a disk read, a network hop, a cold cache) is the highest-payoff use there is, and it's why every LSM-tree database ships one.
Pick by what you must give up, not by the name. Need deletes? Cuckoo or Quotient. Want the smallest possible static filter? Binary Fuse. Replacing a Bloom filter in an LSM tree? Ribbon. The comparison table above is the decision, not the trivia.
For "how many" and "how often," leave the membership branch entirely. HyperLogLog and Count-Min cost kilobytes where exact counting costs gigabytes, and the kilobytes are fixed by your error target, not your data size.
Write the error budget down. The false-positive rate, the failure mode, and the downstream cost of a wrong answer belong in the code comment next to the constructor, not in your head.
Check what the structure reveals, not just how often it's wrong. Bounded error is a feature; an information side channel is a bug. Ask BIP37.

Exact data structures answer the question you asked. Probabilistic ones answer a slightly cheaper question and hand you the price tag up front. The skill is knowing which question you can afford to ask slightly wrong.

Sources

Bloom, "Space/Time Trade-offs in Hash Coding with Allowable Errors" (CACM, 1970) — the founding paper; the original trade of exactness for space.
Fan, Andersen, Kaminsky, Mitzenmacher, "Cuckoo Filter: Practically Better Than Bloom" (CoNEXT, 2014) — deletion and partial-key hashing; wins under ~3% FPR.
Graf, Lemire, "Binary Fuse Filters" (ACM JEA, 2022) — ~13% over the information-theoretic floor, the current space record.
Flajolet, Fusy, Gandouet, Meunier, "HyperLogLog" (AofA, 2007) — cardinality estimation with $1.04/ m$ error.
Cormode, Muthukrishnan, "Count-Min Sketch" (J. Algorithms, 2005) — frequency estimation that never underestimates.
Bender et al., "Don't Thrash: How to Cache Your Hash on Flash" — Quotient filter (PVLDB, 2012) — mergeable, resizable, SSD-friendly membership.
RocksDB, "Ribbon Filter" (2021) — ~30% smaller than Bloom; a production drop-in.
Apache Cassandra docs — Bloom filters — per-SSTable filters, default 0.01 FPR.
Redis docs — probabilistic data types — HyperLogLog plus Bloom, Cuckoo, Count-Min, Top-K.
Chromium Safe Browsing design doc — the Bloom-filter-then-prefix-set migration.