Saturday, January 7, 2012

Bloom filters

The bloom filter is a data structure designed in 1970. It is similar to a hash table. They are more efficient than a hash table, but they do raise false positives at times.


Supported Operations
A bloom filter supports super fast lookup and insertion, just like a hash table. So why have this data structure?

Pros
  1. It is much more space efficient.
    It takes the space even less than that of the object. So, we don't even worry about what the object is, but we only care about whether we have seen that object or not. So, you may have used something called HashSet in java.
Cons
  1. You can't store an associated object. You are only storing a value to indicate whether or not an object is in the data structure. 
  2. Deletions are also not allowed.  (by the way it is still possible to delete, but this point still holds for vanilla or simple bloom filter)
  3. There is also a small false positive probability. This means that even if you haven't inserted object x in the bloom filter, it will say x has been inserted.
Applications
  1. Early spellcheckers (of 1970s) - you add every possible word into the filter. Then you just check word by word if something is in the filter or not. If it is not in the filter, then that is an incorrect word. With small probability, this will say that an incorrectly spelled word is actually correct.
  2. List of forbidden passwords (still relevant) - This could be used to prevent the use of common or too-easily guessed passwords. 
  3. Software used in Network routers (modern) - There is limited memory and it needs a super-fast data structure, which can manage the packets coming at torrential rate, keeping track of denial of service attack, keeping track of banned ip address. 
When to use Bloom filters?
They are light weight than hash table, but they have some probability of false positive error.
So, they can be used where speed and size is the constraint, but little amount of error is allowed.

Implementing Bloom filters
Array of n bits
Like hashtable, they need to use array for fast access. So they use an array and couple of hash functions.

To implement this data structure, you use an array of n bits (bits can be either 0 or 1) . The space occupied by the bloom filter is in terms of number of bits per object that have been inserted in the bloom filter. So, if you have inserted dataset S, and number of bits inserted in the filter is n, then size of the bloom filter or size per object = n / |S|
|S| - size of the set

For eg. if the ratio is 8, then the bloom filter is using the 8 bits per object, which is awesome.(way less than even the storing the pointer of the object)

k hash functions

We also need k has functions, somewhere around 2-5 (You could also make 2 hash functions and make k linear combinations of those 2 functions).

Insert(x)
Insert(x)
{
   for i=1,2,...,k 
      set A[hi(x)]=1
}

So, we set these k positions in array of bits equal to 1.

Lookup(x)
Then for lookup, you just return true if A[hi(x)]==1 for i=1,2,...,k.
Here, you can't get a false negative since we never return an entry in the array back to 0.
However, you can get a false positive if other insertions set all k hi(x) to 1.

For this data structure to be good, we need to see that the probability of a false positive is relatively low, while keeping the space needed small. So, we have to find the trade off curve between the 2.


Implementation in CPP

Bloom filters are very simple. It is a big array of bits, initially all zero.

char *bitarray;
bitarray =malloc(100000000*sizeof(char));
for (i = 0; i < MAXSIZE; i++)
bitarray[i] = 0;


There must also be k different hash functions defined, each of which maps or hashes some set element to one of the m array positions with a uniform random distribution. Let us look at how an element is added in to a bloomfilter. Here i used two different hash function namely, hash1 and hash2. Suppose we are going to represent a dictionary of words using bloom filter, first we will feed each word in to two hash functions to get two array positions which is used to set the corresponding bit in the array of bits. Sometimes there’ll be clashes, where the bit will already be set from some other word. This doesn’t matter.
insert function look like this.
void insert()
{
        char word[ARRSIZE];
        unsigned int h1, h2;

        scanf("%s", word);
        h1 = Hash1(word, strlen(word));
        add(h1);
        h2 = Hash2(word, strlen(word));
        add(h2);
}


Here insert function uses hash1 and hash2 hash functions to produce two array positions. The function add set the corresponding bit in the array of bits. The add function look like this.

#define BITSIZE 8
void add(unsigned int h)
{
        bitarray[h / BITSIZE] |= (1 << (7 - h % BITSIZE));
}

The hash1 and hash2 function look like this.
unsigned int Hash1(char *str, unsigned int len)
{
        unsigned int b    = 1;
        unsigned int a    = 1;
        unsigned int hash = 0;
        unsigned int i    = 0;

       for(i = 0; i < len; str++, i++){
              hash = hash * a + (*str);
              a    = a * b;
        }
        return hash;
}



hash2:
unsigned int Hash2(char* str, unsigned int len)
{
       unsigned int seed = 13;
       unsigned int hash = 0;
       unsigned int i    = 0;

       for(i = 0; i < len; str++, i++){
              hash = (hash * seed) + (*str);
       }
       return hash;
}

Next how to check an element is in the set? first feed the element in to two hash functions to get two array positions. If any of the bits at these positions are
0, the element is not in the set – if it were, then all the bits would have been set to 1 when it was inserted. If all are 1, then either the element is in the set,
or the bits have been set to 1 during the insertion of other elements(The Bloom filter reports a false positive).
void lookup()
{
        unsigned int h1, h2;
        int true1, true2;
        char word[ARRSIZE];

        scanf("%s", word);
        h1 = Hash1(word, strlen(word));
        true1 = bitarray[h1 / BITSIZE] & (1 << (7 - h1 % BITSIZE));
        h2 = Hash2(word, strlen(word));
        true2 = bitarray[h2 / BITSIZE] & (1 << (7 - h2 % BITSIZE));
        if (true1 && true2)
                printf("%s is in the set\n", word);
        else
               printf("%s is not in the set\n", word);

}


Thanks

0 comments:

Post a Comment