Friday, May 2, 2014

Hash Functions


Importance of Hash function
Why Hash-function matter?

Chaining implementation of hashtable
We will discuss about hashtable with chain. Insertion in the hashtable is O(1), so does lookup. 


Invoke hash function and see where is our bucket. Now go to the bucket and traverse the list until we find our element.

Suppose we have 100 elements which are inserted in the hash function, then we have
Lucky hash function - 1 element per bucket
Stupid hash function - all elements in in a[0].


In case of chaining, the lookup time will depend on the link list, and hence depend on the quality of hash function.

Open Addressing implementation

In case of open addressing implementation, the time complexity depend on the probe sequence and hence the implementation of hash function.

Conclusion
So, performance of hash function depends on the quality of hash function, no matter what implementation we use.

Properties of Good Hash function
  1. Should lead to good performance. So, in case of chain implementation, the length of the chain should be equal. So, we should spread data out. So, the gold standard for spreading data out is completely random function.
    Note - Its also important to remember what is our hash function is. So, when we want to look-up for the element, it will again generate hash and hence we will not be easily. We have to store somewhere that we stored Alice there. So, completely random hash function is not possible.
  2. Should be easy to store and very fast to evaluate.So it cannot be random. If suppose we hash for Alice is 39 randomly while insertion, while looking up, it will generate some different hash, and hence we cannot lookup. 
Bad Hash Functions
Bad hash function - Example 1
keys = phone numbers (10 digits)
Size of universe = |U| = 1010. Suppose we won't have more than 500 friends or phone numbers which we have to track, so our |S| = 500. So, we choose 1000 buckets, just doubling it. So, n=1000. So our hash function should take mobile number and produce 3 digit hash function.

terrible hash function
h(x) = floor(x/1000 )

So, suppose we use 3 most significant digits of mobile number given the number, and suppose first 3 digits in the mobile number reflects the operator or area code, which is normally the case. Then this is a terrible idea.

mediocre hash function

h(x) = x mod (1000)

So, suppose now we use last 3 digits as the hash to the key provided, it is still better and their is no evidence that last 3 digits are operator or area related, and hence still better. But still collision is possible, as there may be some pattern which will make the hash function vulnerable.

Bad hash function - Example 2 
key = Memory location of the object
In case of memory location it is in power of 2. Our goal is to take big integer which is power of 2 and producing a good hash.
keys = Memory location, and have power of 2 and hence even.

terrible hash function -
h(x) = x mod (1000)

This hash function is incapable of producing odd hash, and hence half of the buckets are wasted.
So, the mediocre hash function in  example 1, is terrible here.

Good hash function
Answering what is good hash function is quite tricky. So, what we discuss now is okay-ish hash function and not good. For good hash function we have to study much more that :)


Suppose the object type is string, like name. So, our first requirement is to get the integer or hashcode from the object
int GetHashCode(string s)
{
     int sum = 0;
     foreach(Char c in s)
     {
       int x = ascii(c) // get ascii value of char c   
       sum += x;
     }
     return sum;
}

Once you convert the object to integer, then we have to use some compression function like mod (n) where n is the size of the hash table.

int Compress(int num, int size)
{
     return num mod (size); //mod is the the function which 
                            //returns remainder after dividing number num by number size.
}

So overall we do following:
GetBucketIndex(object o)
{
     int hashCode = GetHashCode(o);
     int compressedValue = Compress(hashCode, sizeOFHashTable);
     return compressedValue;
}

So, this gives us the index of the object o, where it has to be inserted.

how to choose the n - number of buckets ?

  1. Choose n to be a prime (within constant factor of number of objects in table)
  2. Prime should not be close to the data. like power of 2 or power of 10.
Note that there is no single solution to good function. 
Here is how hashCode() method is implemented in java for strings.

Summary

As Prof Tim Roughgarden puts it, every Super man has its Kryptonite, every hash function will have its pathogen. I will like to take this answer here, which reiterates what Prof said:
There's no such thing as a “good hash function” for universal hashes (ed. yes, I know there's such a thing as “universal hashing” but that's not what I meant). Depending on the context different criteria determine the quality of a hash.Cryptographic hash functions like SHA aren't at all good for hash tables which you probably mean.

Hash tables have very different requirements. But still, finding a good hash function universally is hard because different data types expose different information that can be hashed. As a rule of thumb it is good to consider all information a type holds equally. This is not always easy or even possible. For reasons of statistics (and hence collision), it is also important to generate a good spread over the problem space, i.e. all possible objects. This means that when hashing numbers between 100 and 1050 it's no good to let the most significant digit play a big part in the hash because for ~ 90% of the objects, this digit will be 0. It's far more important to let the last three digits determine the hash.
Similarly, when hashing strings it's important to consider all characters – except when it's known in advance that the first three characters of all strings will be the same; considering these then is a waste.
This is actually one of the cases where I advise to read what Knuth has to say in The Art of Computer Programming, vol. 3. Another good read is Julienne Walker's The Art of Hashing.


Reference


0 comments:

Post a Comment