Wednesday, July 23, 2014

Find largest 1M numbers in 1B numbers

Problem

Describe an algorithm to find the largest 1 million numbers in 1 billion numbers. Assume that the computer memory can hold all one billion numbers.

Solutions

Method 1 - Sorting
We can certainly sort those numbers and return the last 1 million of them. That takes O(n log n).

Method 2 - Sort partially using some pivot
If we think about it, we actually do not need to sort them. After all, we just need the largest 1 million numbers, in whatever orders. Therefore we can sort of “partially” sort the numbers and try to find the parts that we need. To do this, we get inspiration from Quicksort, where we get two partitions around pivot during each run.

For simplicity, Let us denote the number of elements in the right part as m.
If m is exactly equal to 1 million, we have found the largest 1 million numbers!
If m is larger than 1 million, we need to keep looking for the 1 million numbers, but in the left part of the array this time. We can do this in a recursive way.
If m is less than 1 million, we first remember those m numbers and then we search for the largest (1 million - m) numbers in the left part of the array. Again, recursion here.
Using the random pivot choosing approach, this algorithm takes O(n) time complexity and it does not need any additional space since we can do the partition in place.

Java code
public static int quickPartition(int[] array, int low, int high) {
    if (high <= low)
        return -1;
 
    // randomly choose a pivot
    int index = new Random().nextInt(high - low + 1) + low;
    int pivot = array[index];
 
    // swap the pivot to the front
    int num = array[index];
    array[index] = array[low];
    array[low] = num;
 
    // partition as in Quicksort
    int L = low + 1;
    int R = high;
 
    while (R >= L) {
        while (R >= L && array[L] < pivot)
            L++;
        while (R >= L && array[R] >= pivot)
            R--;
        if (R >= L) {
            int temp = array[L];
            array[L] = array[R];
            array[R] = temp;
            L++;
            R--;
        }
    }
 
    // swap the pivot back to appropriate position.
    num = array[R];
    array[R] = array[low];
    array[low] = num;
 
    return R;
}
 
public static List<Integer> largestM(int[] numbers, int M, int low, int high) {
    // RIndex = index of pivot
    int RIndex = quickPartition(numbers, low, high);
    // mFound = # of the largest numbers
    int mFound = high + 1 - RIndex;
 
    List<Integer> results = new LinkedList<Integer>();
 
    if (mFound == M) { // Done!
        int i = RIndex;
        for (; i <= high; ++i)
            <span class="skimlinks-unlinked">results.add(numbers</span>[i]);
        return results;
 
    } else if (mFound > M) { // Keep looking
        // check if those mFound elements are actually the same
        boolean duplicates = true;
        for (int i = RIndex; i <= high; ++i) {
            if (numbers[i] != numbers[RIndex])
                duplicates = false;
        }
 
        if (!duplicates)
            return largestM(numbers, M, RIndex, high);
        else {
            // if they are the same, just return M duplicates of them
            for (int k = 0; k < M; ++k)
                <span class="skimlinks-unlinked">results.add(numbers</span>[RIndex]);
            return results;
        }
    } else { // Has found some, keep looking for the rest
        int i = RIndex;
        for (; i < numbers.length; ++i)
            <span class="skimlinks-unlinked">results.add(numbers</span>[i]);
 
        results.addAll(largestM(numbers, M - mFound, low, RIndex - 1));
        return results;
    }
 
}

Method 3 - Using heap
Run them all through a max-heap of size 100: for each input number k, replace the current min m with max(k, m). Afterwards the heap holds the 100 largest inputs.
A search engine like Lucene can use this method, with refinements, to choose the most-relevant search answers.
1) Build a Max Heap tree in O(n)
2) Use Extract Max k times to get k maximum elements from the Max Heap O(klogn)
Time complexity: O(n + klogn)

Thanks
References

0 comments:

Post a Comment