What is the advantage to using bloom filters?
From Wikipedia:
Bloom filters have a strong space advantage over other data structures for representing sets, such as self-balancing binary search trees, tries, hash tables, or simple arrays or linked lists of the entries. Most of these require storing at least the data items themselves, which can require anywhere from a small number of bits, for small integers, to an arbitrary number of bits, such as for strings (tries are an exception, since they can share storage between elements with equal prefixes). Linked structures incur an additional linear space overhead for pointers. A Bloom filter with 1% error and an optimal value of k, on the other hand, requires only about 9.6 bits per element — regardless of the size of the elements. This advantage comes partly from its compactness, inherited from arrays, and partly from its probabilistic nature. If a 1% false positive rate seems too high, each time we add about 4.8 bits per element we decrease it by ten times.
Pretty clear to me.
A bloom filter doesn't store the elements themselves, this is the crucial point. You don't use a bloom filter to test if an element is present, you use it to test whether it's certainly not present, since it guarantees no false negatives. This lets you not do extra work for elements that don't exist in a set (such as disk IO to look them up).
And all in significantly less space than something like a hash table (which is likely going to be partially on disk for large data sets). Though you may use a bloom filter in conjunction with a structure like a hash table, once you're certain the element has a chance of being present.
So an example usage pattern might be:
You've got a lot of data, on disk -- you decide on what error bound you want (e.g. 1%), that prescribes the value of m. Then the optimal k is determined (from the formula given in the article). You populate your filter from this disk-bound data once.
Now you have the filter in RAM. When you need to process some element, you query your filter to see if it stands a chance of existing in your data set. If it doesn't, no extra work is done. No disk reads, etc. (Which you would have to do if it were a hash or tree, etc).
Otherwise, if the filter says "Yes, it's in there", there's a 1% chance that it's wrong, so you do the necessary work to find out. 99% of the time, it really will be there, so the work was not for naught.
I will start with the explanation of what is a bloom filter, what it can and can't do, why do we need it, show an intuitive description how it works and then give some example when they can be useful.
So a standard bloom filter is a probabilistic data structure that can*:
- add element to a set
- check if an element is in the set by telling
definitely not in the set
orpossibly in the set
This possibly in the set
is exactly why it is called probabilistic. Using smart words it means that false positive are possible (there can be cases where it falsely thinks that the element is positive) but false negative are impossible.
But it can't *:
- remove an item from the set
- give you a list of all elements that are currently in your set
*This set of can/can't is for a basic bloom filter. Because it is a useful data structure that was created long time ago, people found how to augment it with other useful features.
But wait a minute: we already know a data structure that can answer all of this without vague 'possible' and also without all of the limitations (can't remove, can't show all). And it is called a set. And here comes a main advantage of a bloom filter: it is space efficient and space constant.
This means that it does not matter how many elements do we store there, the space will be the same. Yes a bloom filter with 10^6
elements (useless bloom filter) will take the same amount of space as a bloom filter with 10^20
elements and the same space as bloom filter with 0
elements. So how much space will it take? It is up to you to decide (but there is a trade of: the more elements you have the more uncertain you are with you possible in the set
answer.
Another cool thing is that it is space constant. When you save the data to a set, you have to actually save this data. So if you store this long string in the set
you have to at least use 27 bytes of space. But for a 1% error and an optimal value of k **, you will need ~ 9.6 bits ( < 2 bytes) per any element (whether it is a short int or a huge wall of text).
Another property is that all the operations are taking constant time, which is absolutely not the same as amortized constant time in case of sets (remember that if the set has collisions, it can deteriorate in O(n)
time).
**k is a value of hash functions used in the bloom filter
I will not describe how the bloom filters work (wikipedia article does a very good job explaining everything). Here I will just briefly tell the basics.
- you initiate an empty bit array of length
m
- you select
k
different hash functions (the more independent the better) - if you want to add element, you calculate all the
k
hashes of this value and set the corresponding bits to 1 - if you want to check if element exist, you also calculate all
k
hashes and if at least one of them is not set, it is surely is not in the set. Otherwise it can be in the set.
Even this description is enough to understand why we can't be sure (you can get all bits set from various other values). Here is a very nice visualization of how it works.
So when can bloom filters be useful? The short answer is everywhere where false positive are acceptable and where you would want to check if something is in the set, but even if they are not, it can be a first line of defense to rule out expensive calls to verifiers.
Here is a list of more concrete descriptions:
- a standard example of malicious websites and a browser is described in almost any place where people talk about bloom filters
- is a passwords weak: instead of having a huge set of all possible weak passwords, you can just check whether the password is surely not weak with a way smaller bloom filter
- if you have a list of articles and a list of users, you can use bloom filter to show users' articles they have not read. Interesting thing is that you can have only one filter (you check whether the combination of user_id + article_id is there)
- bitcoin uses bloom filter for wallet synchronization
- Akamai's web servers use Bloom filters to prevent "one-hit-wonders" from being stored in its disk caches. One-hit-wonders are web objects requested by users just once, something that Akamai found applied to nearly three-quarters of their caching infrastructure. Using a Bloom filter to detect the second request for a web object and caching that object only on its second request prevents one-hit wonders from entering the disk cache, significantly reducing disk workload and increasing disk cache hit rates (taken from examples in bloom's filter article at wiki)
Alex has explained it pretty well. For those who still did not get quite a grasp on it, hopefully this example will help you understand:
Lets say I work for Google, in the Chrome team, and I want to add a feature to the browser which notifies the user if the url he has entered is a malicious URL. So I have a dataset of about 1 million malicious URLs, the size of this file being around 25MB. Since the size is quite big, (big in comparison to the size of the browser itself), I store this data on a remote server.
Case 1 : I use a hash function with a hash table. I decide on an efficient hashing function, and run all the 1 million urls through the hashing function to get hash keys. I then make a hash table (an array), where the hash key would give me the index to place that URL. So now once I have hashed and filled the hashing table, I check its size. I have stored all 1 million URLs in the hash table along with their keys. So the size is at least 25 MB. This hash table, due to its size will be stored on a remote server. When a user comes along and enters a URL in the address bar, I need to check if it malicious. Thus I run the URL through the hash function (the browser itself can do this) and I get a hash key for that URL. I now have to make a request to my remote server with that hash key, to check the if the particular URL in my hash table with that particular key, is the same as what the user has entered. If yes then it is malicious and if no then it is not malicious. Thus every time the user enters a URL, a request to the remote server has to be made to check if it is a malicious URL. This would take a lot of time and thus make my browser slow.
Case 2 : I use a bloom filter. The entire list of 1 million URLs are run through the bloom filter using multiple hash functions and the respective positions are marked as 1, in a huge array of 0s. Let's say we want a false positive rate of 1%, using a bloom filter calculator (http://hur.st/bloomfilter?n=1000000&p=0.01) , we get the size of the bloom filter required as only 1.13 MB. This small size is expected as, even though the size of the array is huge, we are only storing 1s or 0s and not the URLs as in case of the hash table.This array can be treated as a bit array. That is, since we have only two values 1 and 0, we can set individual bits instead of bytes. This would reduce the space taken by 8 times. This 1.13 MB bloom filter, due to its small size, can be stored in the web browser itself !! Thus when a user comes along and enters a URL, we simply apply the required hash functions (in the browser itself), and check all the positions in the bloom filter (which is stored in the browser). A value of 0 in any of the positions tells us that this URL is DEFINITELY NOT in the list of malicious URLs and the user can proceed freely. Thus we did not make a call to the server and hence saved time. A value of 1 tells us that the URL MIGHT be in the list of malicious URLs. In these cases we make a call to the remote server and over there we can use some other hash function with some hash table as in the first case to retrieve and check if the URL is actually present. Since most of the times, a URL is not likely to be a malicious one, the small bloom filter in the browser figures that out and hence saves time by avoiding calls to the remote server. Only in some cases, if the bloom filter tells us that the URL MIGHT be malicious, only in those cases we make a call to the server. That 'MIGHT' is 99% right.
So by using a small bloom filter in the browser, we have saved a lot of time as we do not need to make server calls for every URL entered.
We can see that hash table with a single hash function is used for a different purpose altogether than a bloom filter. Hopefully this clears your doubts :)
edit:
I have implemented a bloom filter for the task of malicious URL testing in Python. The code can be found here - https://github.com/tarunsharma1/Bloom-Filter The code is very simple to understand and a detailed description is provided in the readme file.