Bloom Filter

A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set.

For example, checking the availability of a username is set membership problem, where the set is the list of all registered usernames.

Algorithm
- Each item hashed to multiple indexes. If all indexes are set the element is present. If not then certainly not present.
Probabilistic data structures will give you memory-efficient, faster results with a cost of providing a probable result instead of a certain one
Suitable for data that needs no deletion
A Bloom filter is an inexact representation of a set that allows for false positives when queried; that is, it can sometimes say that an element is in the set when it is not
Bloom filter can check for if a value is ‘possibly in the set’ or ‘definitely not in the set
You might already understand that if the size of the bloom filter is too small, soon enough all of the bit fields will turn into ‘1’ and then our bloom filter will return ‘false positive’ for every input. So, the size of the bloom filter is a very important decision to be made.
A larger filter will have fewer false positives and a smaller one more. So, we can tune our bloom filter to how precise we need it to be based on the ‘false positive error rate’
False positive result
- Good
- Can recheck
False-negative
- Bad
- No recheck
Counting bloom filter
- Instead of 1 required count

We can calculate the false positive error rate, p, based on the size of the filter, m, the number of hash functions, k, and the number of elements inserted, n, with the formula:

(1 - e ^ - (k * n) / m ) ^ k

Usage

Apache Cassandra uses SSTable
Used in Postgres for query optimization
To skip the recommendations that are already served to you, bloom filters are used.
Avoid caching the items that are very rarely searched or searched only once. Only when they are searched more than once, they will get cached.

HyperLogLog

Approximation: Number of zeros at the end, N then 2^N
Use bucket, 70% average.

References

Bloom Filter
Murmur Hash Function
FNV Hash Function
HashMix Hash Function

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

probabilistic algorithm.md

probabilistic algorithm.md

Bloom Filter

Usage

HyperLogLog

References

Files

probabilistic algorithm.md

Latest commit

History

probabilistic algorithm.md

File metadata and controls

Bloom Filter

Usage

HyperLogLog

References