Saturday, August 10, 2013

Hash Tables

A hash table is primarily used to keep track of an evolving set of stuff by using a key. It is good for inserting, deleting, and lookups. It is also sometimes referred to as a "dictionary."

They are similar to array in a way, that array gives information at the index very fast, if you know the index. Suppose, your friend names are integer, and you save the number in the array. If you need phone number of your friend 173, you can get it as array[173]. But the friend names are strings like Alice, bob, charlie, etc So, we have to have an array with random access to the array with index as strings and not integer.

Operations 
  • Insert
  • Delete
  • Lookup (very important)
The operations that it supports are all blazingly fast, O(1) running time. However, this is only if it is implemented properly.

Note: A hash table should not be used if you need to keep things in some particular order.
So, when sometimes, hashtable is called dictionary, don't assume any order, its like misnomer.
 
2 caveats:
  • Hashtable can be easily implemented badly - So, its better to use some standard library.
    Hence, if you have not implemented well, this O(1) guarantee will not work. But, if you are forced to come up with your own hashtable, then you will get O(1) guarantee, only if you implement well.
  • Unlike, most of the problems in data structure, hashtable doesn't enjoy worst case guarantee. So, you can't guarantee that searching for any data will guarantee any worst case guarantee.

Applications
De-duplication OR Removing the duplicates: You are given a stream of object, such as a linear scan through a large file or objects arriving in real time. The goal is to remove duplicates (only keep track of unique objects). This could be  used to report only unique visitors to a website and avoid duplicates in search results.

When a new object x arrives, look it up in hash table H. If it is not found, then insert x into H.

The 2-SUM Problem: You are given an array of n integers and a target sum T. The goal is to determine whether or not there are two numbers x,y in A with x+y=T.
(I have already mentioned this problem here, but showing here in summary for application of stack)

Approach 1:
The naive way to do this would be to check all combinations (n choose 2). This exhaustive search is O(n2).

Approach 2: 
 A better way would be to sort the array. This takes O(n log n)
Then for each x in array A, use binary search to look for T-x. This will take O(nlogn).
So, overall search is  O(n log n)

Approach 3 :
The best way would be to insert every element into a hash table (without sorting). This takes O(n) as constant time insertion.
Then for every x, we can just look up its complement, T-x, which is O(1).
Overall it takes will be O(n).

Other application

  • historical application - Hash tables was first designed for and used to symbol table lookups for a compiler. 
  • It can also be used to block network traffic. If the IP is in blacklist, just drop the packet and don't allow to connect.
  • Speeding up search algorithms like in Game theory or Chess. (There are more chess configurations than size of web). You can add a configuration to a hash table to avoid exploring it more than once. 
How to Implement HashTables?
 So how is a hash table implemented?
Setup - Universe U (e.g. all possible IP address, or all possible names or all possible mobile numbers or all possible chess board configuration). You might have noticed that universe U is very large.

Goal - Set S is of reasonable size S⊆ U. So, this set S is of reasonable size, like you keeping phone number of your friends.

Selecting data-structure for implementation
Without data-structure, we will not get good solution. We could use an array for quick insert, deletion, and lookup, but it will take up a lot of space. Also, your friend name should be integer and space requirement is proportional to universe.

On the other hand, we could use a linked list. This would only take up as much as space as there are objects i.e. Set S, but the 3 operations would not be O(1). To resolve this, we can use both.

So, the solution is to use the best of both worlds, i.e. fast lookup of arrays and small storage of size like link list.


Solution 1 :
The first way to implement a hash table could be to use an array of linked lists. Now, the array size will not be the size of universe, but the size proportional to S, the set we are storing. So, let array size be n which is proportional to the size of the S. For the time being lets assume that the size of set S, doesn't fluctuates.
(Note: If size of S increases, we have to re-allocate the array to keep our data structure and vice versa for shrinking of set S)

We will be calling each entry of array as bucket. So, we have n buckets, where
n ∝ S i.e. n = kS where k is some constant.

Now we need to translate between the things we care about like names of the friends to the position in the array.  This is achieved by hash function.
Formal definition of hash function - So, hash function takes as input a key like IP address OR name of your friend and returns the position in this array. This is not specified how you want to implement this function.

If, suppose we have friend Alice,
hash(Alice) = 17
This implies, we have to save Alice has to be stored at the 17th bucket(or position) of the array.


We use a hash function to determine which index the object goes into. If there is already something in that index, this is called a "collision" .
 Suppose, hash(Alice) = 17, and hash(Bob) = 17, then how do we solve this.

How often can collision occur?
 The "Birthday Paradox" states that given 23 people, there is a 50% chance that two people will share the same birthday. By this principle, we can see that it doesn't take relatively many objects to cause a collision. See here for more on birthday paradox.

Resolving Collisions - Using Chaining

For x,y  ∈ U, if h(x)=h(y)
So, this U is very large, and hash function is to somehow map all this universe with buckets n in array A:
Separate Chaining
To resolve this, we just make a chain at that index of the array. So we keep linked list in each bucket
given a key/objectx, perfrom insert/delete/lookup in the list in A[h(x)].

For eg., Suppose we have 4 buckets, in the first bucket we have A (Say alice), and in bucket 3 we have B(say Bob) and D(say Daniel) as there was collision between B.
Open Addressing

The second way to implement a hash table would be to use open addressing. Though it is practically hard to explain mathematically, but is very important solution. Here we don't use pointer or list at all.
You will have 1 object per bucket. So, if you have inserted Bob at bucket, and Daniel collides, you will again probe the hash function to get the next position. So, hash function becomes hash sequence.

When there is a collision, simply put it elsewhere. This could be the next available index, which is called linear probing. So, you go on produce hash until you find next open slot.

Another we solution we have double probing - we have 2 use hash function h1(x) and h2(x). Suppose h1(Daniel) = 17, and h2(Daniel) = 23. If position 17  is empty, you insert Daniel there. But if it is full, then you h2(Daniel) will act as an additive offset So if position 17 is empty, you will place Daniel at 17+23 = 40 th position. Likewise, if 40 is also full, the look for position 40+23 i.e. 63 and so on.

Which solution for collision resolving is better?
If space is problem, then use probing, because chaining uses more space
Deletion is more trickier for probing, and then you can use chaining. 


Implementing Good Hash function
We will discuss about hashtable with chain. Insertion in the hashtable is O(1).

Lookup
Invoke hash function and see where is our bucket.
Now go to the bucket and traverse the list until we find our element.

Suppose we have 100 elements which are inserted in the hash function, then we have
Lucky hash function - 1 element per bucket
Stupid hash function - all elements in in a[0].

In case of chaining, the lookup time will depend on the link list, and hence depend on the quality of hash function.

In case of open addressing implementation, the time complexity depend on the probe sequence and hence the implementation of hash function.

So, performance of hash function depends on the quality of hash function, no matter what implementation we use.

Properties of Good hash function
  1. Should lead to good performance. So, in case of chain implementation, the length of the chain should be equal. So, we should spread data out. So, the gold standard for spreading data out is completely random function.
    Note - Its also important to remember what is our hash function is. So, when we want to lookup for the element, it will again generate hash and hence we will not be easily. We have to store somewhere that we stored Alice there. So, completely random hash function is not possible.
  2. Should be easy to store and very fast to evaluate. 
Bad Hash Functions
Bad hash function - Example 1
keys = phone numbers (10 digits)
Size of universe = |U| = 1010. Suppose we won't have more than 500 friends or phone numbers which we have to track, so our |S| = 500. So, we choose 1000 buckets, just doubling it. So, n=1000. So our hash function should take mobile number and produce 3 digit hash function.

terrible hash function
h(x) = floor(x/1000 )

So, suppose we use 3 most significant digits of mobile number given the number, and suppose first 3 digits in the mobile number reflects the operator or area code, which is normally the case. Then this is a terrible idea.

mediocre hash function

h(x) = x mod (1000)

So, suppose now we use last 3 digits as the hash to the key provided, it is still better and their is no evidence that last 3 digits are operator or area related, and hence still better. But still collision is possible, as there may be some pattern which will make the hash function vulnerable.

Bad hash function - Example 2 
key = Memory location of the object
In case of memory location it is in power of 2. Our goal is to take big integer which is power of 2 and producing a good hash.
keys = Memory location, and have power of 2 and hence even.

terrible hash function -
h(x) = x mod (1000)

This hash function is incapable of producing odd hash, and hence half of the buckets are wasted.
So, the mediocre hash function in  example 1, is terrible here.

Good hash function
Answering what is good hash function is quite tricky. So, what we discuss now is okay-ish hash function and not good. For good hash function we have to study much more that :)


Suppose the object type is string, like name. So, our first requirement is to get the integer or hashcode from the object
int GetHashCode(string s)
{
     int sum = 0;
     foreach(Char c in s)
     {
       int x = ascii(c) // get ascii value of char c   
       sum += x;
     }
     return sum;
}

Once you convert the object to integer, then we have to use some compression function like mod (n) where n is the size of the hash table.

int Compress(int num, int size)
{
     return num mod (size); //mod is the the function which 
                            //returns remainder after dividing number num by number size.
}

So overall we do following:
GetBucketIndex(object o)
{
     int hashCode = GetHashCode(o);
     int compressedValue = Compress(hashCode, sizeOFHashTable);
     return compressedValue;
}

So, this gives us the index of the object o, where it has to be inserted.

how to choose the n - number of buckets ?

  1. Choose n to be a prime (within constant factor of number of objects in table)
  2. Prime should not be close to the data. like power of 2 or power of 10.
Note that there is no single solution to good function. 
Here is how hashCode() method is implemented in java for strings. Thanks.


0 comments:

Post a Comment