Monday, April 26, 2010

External Sorting - How do you sort a million 32-bit integers in 2MB RAM?

A million 32-bit integers is 4MB. We cannot fit them all in the given RAM size and we have to do external sorting. It is an open ended question which is interactive.

Let's assume that -

  • The list of numbers is on disk (as we cannot fit them all in RAM).
  • There is a sort function which sorts the given list effectively i.e. we do not have to worry about a particular sorting algorithm.
  • There is lots of disk space. 
  • As we are short on RAM, the time complexity is not a constraint. This is an important one.
I would:
- divide the list into manageable sizes (in this case, it would be 250,000 numbers of 1MB).
- sort the list, save it on the disk and continue with other list.
- then do an m-way merge on the sorted lists.

m-way (or n-way) merge:

If you have m lists, and every one is sorted, then from the beginning of the lists pick the smallest from one of the list, fetch the next element in that same list. Continue till all elements from all m lists are exhausted. Just like the 2-way merge in merge sort.

This particular m-way merge step is quite inefficient and for m lists containing n/m elements each, it results in O(m*n) time. This is where I realized that time complexity is not important. However, we can maintain a min-heap of the front elements of the m lists and can achieve a time of O(logm) for selecting the smallest - delete the smallest element from the min-heap, insert the next element into the min-heap, each operation takes O(logm) time. So the m-way (n-way) merge for m lists and n elements would take O(logm *n) time. Also, we can read a chunk of size (n/m*m) for each list into memory and so at any time, we have total of n/m elements in memory. 

0 comments:

Post a Comment