Friday, March 28, 2014

Sorting a 2GB file with one string per line

If you have a 2GB file with one string per line, which sorting algorithm would you use to sort the file and why?
When an interviewer gives a size limit of 2GB, it should tell you something – in this case, it suggests that they don’t want you to bring all the data into memory.
So what do we do? We only bring part of the data into memory..

Method 1 - K way external merge sort
Because 2GB size of strings are way too huge to be put into main memory, I came up with two ways:
  1. K-way merge sort. Divide the file into K pieces, transfer them into main memory and sort them.
  2. Bucket sort. Sort each character in order.
  1. Divide the file into K chunks of size X, where X * K = 2 GB. Bring each chunk into memory and sort the lines as usual using any O(nlogn) algorithm. Save the lines back to the file.
  2. Now bring the next chunk into memory and sort.
  3. Once we’re done, merge them one by one.This is called K-way merge.
The above algorithm is also known as external sort. The last step depends on the size of the main memory available. We can either merge 2 at a time or proper K way merge.

Example - Merge 2 (or 3) at a time
Suppose we have 3 sets of data - A,B,C. K = 3.
Now we can merge A and B together , such that X = A + B. Then we can merge X and C. Y = X+C.
This method is uses 2 merges at a time and requires less memory.

Example - Proper K way merge
where you select the next element from any of the groups. With that you would check the lowest value in every list to see which one comes next:
Y = A + B + C.

Which of these you choose depends on the available memory and the element size.

For example, if you have 100M memory available to you and the element size is 100K, you can use the latter. That's because, for a 2G file, you need 20 groups (of 100M each) for the sort phase which means a proper N-way merge will need 100K by 20, or about 2M, well under your memory availability.
Alternatively, let's say you only have 1M available. That will be about 2000 (2G / 1M) groups and multiplying that by 100K gives 200M, well beyond your capacity.

So you would have to do that merge in multiple passes. Keep in mind though that it doesn't have to be multiple passes merging two lists.
You could find a middle ground where for example each pass merges ten lists. Ten groups of 100K is only a meg so will fit into your memory constraint and that will result in fewer merge passes.

Method 2 - Caching
Here what we can do is we can again divide the 2G file into K chunks of X memory. So, the chunks become X1,X2,X3 and so on. Now we load chunk X1, iterate over chunk X2,  merge X1 and X2, and only keep lowest element in Memory. So, now we have  Y = X1 + X2. But, only Y/2 is kept in memory and likewise we will continue. Of-course this method can be improved upon to bring upon deterministic-ness.


Post a Comment