Sunday, August 11, 2013

Pathological Data Sets and Universal Hashing Motivation

We will go deep inside hash function, where we will find every hash table has its Kryptonite, i.e some pathological data set, which will make it vulnerable.
Note:To continue on this article, please first read the article - Hash Tables.

The Load factor or Load of a hash table
It tells us how populated the hash table is.It is denoted by alpha (α)

α = (number of objects in the hash table) / (number of buckets in the hash table)

The load of a hash table is the total number of elements in the hash table divided by total amount of space in the table. In case of chaining implementation of hash table, the number of elements can be more than bucket size, and hence α > 1, but in case of open addressing, this is not possible.

Note:
  1. Alpha=O(1) is necessary for operations to run in constant time for chained implementation.
    With open addressing we need alpha << 1. For good performance, we need to control load.
  2. Secondly, for good performance, we need a good hash function, i.e. spreads the data evenly across buckets. However, a hash table that spreads out every data set does not exist. That is, for every hash function, there is a pathological data set.
We don't know how clients will put in the data, but we surely have control over denominator i..e. number of buckets in the hash table. So, if the data increase some particular limit, you have to increase the size of the buckets, and hence controlling the load

Great hash function possibility is no,no
Can we have super clever hash function which guarantees the spread every data set out evenly?

Answer is no. No matter what function you write, you will always find some data, to which this hash function is vulnerable.

Notes:
  1. Comparing the hash functions with algorithms, we can't say for sure that the hash function will give good performance no matter what the data is. This is unlike known algorithms, like merge sort, no matter what data you give, time complexity is O(n log n) etc. 
  2.  Some one will create the pathological data set which will exploit the hash function for some purpose, for eg. denial of service attack.

    Refer the paper - Pathological Data in the Real World ,Reference: Crosby and Wallach, USENIX 2003, whcih discusses how people can paralyze several real world systems (i.e. network intrusion detection) by exploiting badly designed hash function)
Two characteristics of these breakable systems were:
  1. Open source
  2. Overly simplistic hash function designed for speed (easy to reverse engineer a pathological data set)
Solution
There are many solutions, but here are 2:
  1. Cryptographic hash function e.g. SHA-2
    These cryptographic hash functions have their own pathological data set, but it is in-feasible to figure out what this hash function is. So, they work well in practice.
  2. Use randomization.
    -- We will not design a single hash function, but a family of hash functions and at runtime we will pick up these hash function. Eg., in quick sort there was one way by which quick sort will work as O(n^2) algo, so we have to pick up the pivot randomly.
    So, in this way you can still make your code open source, but people will find it difficult to find what hash function was used in real time.

Resource:

0 comments:

Post a Comment