A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set.
For example, checking the availability of a username is set membership problem, where the set is the list of all registered usernames.
-
Algorithm
- Each item hashed to multiple indexes. If all indexes are set the element is present. If not then certainly not present.
-
Probabilistic data structures will give you memory-efficient, faster results with a cost of providing a
probable
result instead of acertain
one -
Suitable for data that needs no deletion
-
A Bloom filter is an inexact representation of a set that allows for false positives when queried; that is, it can sometimes say that an element is in the set when it is not
-
Bloom filter can check for if a value is ‘possibly in the set’ or ‘definitely not in the set
-
You might already understand that if the size of the bloom filter is too small, soon enough all of the bit fields will turn into ‘1’ and then our bloom filter will return ‘false positive’ for every input. So, the size of the bloom filter is a very important decision to be made.
-
A larger filter will have fewer false positives and a smaller one more. So, we can tune our bloom filter to how precise we need it to be based on the ‘false positive error rate’
-
False positive result
- Good
- Can recheck
-
False-negative
- Bad
- No recheck
-
Counting bloom filter
- Instead of 1 required count
We can calculate the false positive error rate, p, based on the size of the filter, m, the number of hash functions, k, and the number of elements inserted, n, with the formula:
(1 - e ^ - (k * n) / m ) ^ k
- Apache Cassandra uses SSTable
- Used in Postgres for query optimization
- To skip the recommendations that are already served to you, bloom filters are used.
- Avoid caching the items that are very rarely searched or searched only once. Only when they are searched more than once, they will get cached.
- Approximation: Number of zeros at the end, N then 2^N
- Use bucket, 70% average.