What is a Bloom filter?

Answer

A Bloom filter is a space-efficient probabilistic data structure that tests whether an element is a member of a set. It may produce false positives (says "element might be in the set" when it isn't) but never false negatives (if it says "not in set," it definitely isn't). How it works: a bit array of m bits (all initially 0) + k independent hash functions. Insertion: hash the element with each of the k functions → set those k bits to 1. Lookup: hash with k functions → check those k bits. If all k bits are 1 → probably in set (false positive possible). If any bit is 0 → definitely not in set. No deletion: setting bits back to 0 would affect other elements — use Counting Bloom Filter for deletions. Tuning: more bits (larger m) → fewer false positives; more hash functions → tradeoff between false positive rate and insertion speed. Optimal false positive rate depends on m, k, and n (elements inserted). 1% false positive: ~10 bits per element. Applications: (1) Web crawler: check if URL already visited — avoid recrawling; (2) Database query optimization: check if a value might exist before expensive disk lookup; (3) CDN: identify if content is cached at edge; (4) Distributed caching: check if key exists before fetching from DB; (5) Chrome safe browsing: check if URL is in malicious list; (6) Cassandra, HBase: avoid disk reads for non-existent keys. Space advantage: 1 million elements with 1% false positive rate ≈ 1.2MB vs HashSet ≈ 40MB.

Answer

More System Design Questions