Monday, January 18, 2010

Find the median in a continous stream of numbers

Problem
Numbers are randomly generated and passed to a method. Write a program to find and maintain the median value as new values are generated.
OR

You are given a stream of numbers which can be positive or negative. You are required to provide an operation FIND MEDIAN..which when invoked should be able return the median of the numbers in stream(in say O(1) time)

OR 
Find the rolling median?

Example

0 1 2 3 4
Right FIND MEDIAN returns 2 as median

Now input is -2 -4
So Stream = 0 1 2 3 -2 -2 -4
Sorted Stream would be -4 -2 0 1 2 3 4 and FIND MEDIAN returns 1

Solution

Method 1 - Using heap
We maintain two heaps: minHeap and maxHeap. If we represent these two heaps in terms of list, minHeap is in descending order whereas maxHeap is in ascending order. If we manage to maintain these two heaps while getting new numbers, the median is either in the first place of minHeap or in the first place of maxHeap (or the average of these two).

Pseudocode
Step 1: Add next item to one of the heaps

   if next item is smaller than maxHeap root add it to maxHeap,
   else add it to minHeap

Step 2: Balance the heaps (after this step heaps will be either balanced or
   one of them will contain 1 more item)

   if number of elements in one of the heaps is greater than the other by
   more than 1, remove the root element from the one containing more elements and
   add to the other one

if some one calls getMedian...then
   If the heaps contain equal elements;
     median = (root of maxHeap + root of minHeap)/2
   Else
     median = root of the heap with more elements

Consider the example of sequence of numbers: 1, 2, 3, 4 and 5. The maintainence of minHeap, maxHeap are shown below:

After 1 coming:
minHeap: 1
maxHeap:
median : 1

After 2 coming:
minHeap: 1
maxHeap: 2
median : 1.5

After 3 coming:
minHeap: 1
maxHeap: 2 3
median : 2

After 4 coming:
minHeap: 1 2
maxHeap: 3 4
median : 2.5

After 5 coming:
minHeap: 1 2
maxHeap: 3 4 5
median : 3

Java code
public static class maxComparator implements Comparator<Double> {
    public int compare(Double o1, Double o2) {
        return -o1.compareTo(o2);
    }
}
 
static PriorityQueue<Double> minHeap = new PriorityQueue<Double>(11,
        new maxComparator());
static PriorityQueue<Double> maxHeap = new PriorityQueue<Double>();
 
public static void addToHeap(double number) {
    if (minHeap.size() == maxHeap.size()) {
        if (minHeap.isEmpty())
            minHeap.add(number);
        else if (number > minHeap.peek())
            maxHeap.add(number);
        else { // number < minHeap.peek()
            maxHeap.add(minHeap.poll());
            minHeap.add(number);
        }
    } else if (minHeap.size() > maxHeap.size()) {
        if (number > minHeap.peek())
            maxHeap.add(number);
        else { // number <= minHeap.peek()
            maxHeap.add(minHeap.poll());
            minHeap.add(number);
        }
    } else { // minHeap.size() < maxHeap.size()
        if (number < maxHeap.peek())
            minHeap.add(number);
        else { // number >= maxHeap.peek()
            minHeap.add(maxHeap.poll());
            maxHeap.add(number);
        }
    }
}
 
public static double getMedian() {
    if (minHeap.size() == maxHeap.size())
        return (minHeap.peek() + maxHeap.peek()) / 2;
    else if (minHeap.size() > maxHeap.size())
        return minHeap.peek();
    else
        return maxHeap.peek();
}

Method 2 - Using counting sort for integers
We can use counting sort for integers, which will give us A[n/2] as median. But of course this will not work for

Method 3 - Using Balanced Binary Search Trees

At every node of BST, maintain number of elements in the subtree rooted at that node. We can use a node as root of simple binary tree, whose left child is self balancing BST with elements less than root and right child is self balancing BST with elements greater than root. The root element always holds effective median.
If left and right subtrees contain same number of elements, root node holds average of left and right subtree root data. Otherwise, root contains same data as the root of subtree which is having more elements. After processing an incoming element, the left and right subtrees (BST) are differed utmost by 1.
Self balancing BST is costly in managing balancing factor of BST. However, they provide sorted data which we don’t need. We need median only. The next method make use of Heaps to trace median.
  1. Data Structure providing sorted order is a Binary Search Tree. However worst case it can be linear list.
  2. Store the numbers in a BST..if the tree is not fully balanced i.e. it's left or right skewed use tree rotations across the root in O(1) to make the height of left and right subtree same.
  3. Now root node would be at the median. This can be found in O(1) time.

Now the height of left and right subtrees across the root is same and we stop.
Finding height of a binary search tree can be done in O(log h) where h is the height of tree

Method 4 - Using skiplist and queue
See the answer here.

Method 5 - P2 algorithm
See the answer here.

Thanks. Please share if you have some other ways.

References

0 comments:

Post a Comment