Thursday 30 May 2024

Bloom Filters: The Speedy Way to Confirm Element Absence

 A Bloom filter is a clever way to check if something is part of a group or not, like checking if a given word is in a list of words or not. But here's the twist, If it says something isn't in the list, it's definitely not there. But if it says something might be in the list, it's not completely sure.

 

Initialization of Bloom Filter

A bloom filter is designed with an array of bits, and k hash functions, where all the bits are initialized with 0. Let's take a Java Bitset of size 10, and three function h1, h2 and h3 to demonstrate this example.

 

Let's understand the process of incorporating the term 'dog' into this Bloom filter. We use three hash functions along with a bit array of 10 elements, initially all set to zero. The hash values for the string 'dog' looks like below.

h1(dog) % 10 = 5
h2(dog) % 10 = 3
h3(dog) % 10 = 8


 Let's set the bits at the hash function outcomes to 1 (in this case, bitset or bit array at the index 5, 3 and 8 is set to 1).

Let's add another item "Apple" into the Bloom Filter, we'll do some calculations to find out where it fits:

h1("Apple") % 10 = 7
h2("Apple") % 10 = 9
h3("Apple") % 10 = 1


 
in this case, bitset or bit array at the index 7, 9 and 1 is set to 1

How do we verify if something absence the Bloom filter?

To check if "Banana" is in the Bloom Filter, we calculate its hashes.

h1("Banana") % 10 = 4
h2("Banana") % 10 = 7
h3("Banana") % 10 = 3

After calculating these hashes, we examine whether all the corresponding index positions are set to 1 or not.



In this example, the values at indexes 3 and 7 are set to 1, while the value at index 4 is set to 0. So, we can confirm that this element is not present in the Bloom Filter.

False Positives problem in Bloom Filter

A Bloom filter can confidently state the absence of an element, but it cannot guarantee the presence of an element with 100% confidence. Consider another scenario to illustrate this point, where we're checking for the existence of the data 'Cat'. The respective hashes are as follows.

h1("Cat") % 10 = 1
h2("Cat") % 10 = 7
h3("Cat") % 10 = 5

Upon examining the bit array, we find that the bits at these indices are set to 1. However, it's important to note that "Cat" was never added to the filter. The bits at index 1 and 7 were set when we added "Apple", and the bit at index 5 was set when we added "dog".



How to reduce the impact of False Positives?

By using a larger bit array and more hash functions, we can decrease the impact of more false positives. However, it's important to maintain the balance between the size of the Bloom filter and the number of hash functions used. If there are more hash functions, it takes more time to insert into Bloom Filter.


                                                                                System Design Questions

No comments:

Post a Comment