Thursday, December 29, 2011

Sorting a very large file which cannot fit in memory

The solution to it is External sorting.

One solution is to use external merge sort, where you sort small chunks of data first, write it back to disk and than iterate over those to sort all. However this has a limitation because you have to read write the data again and again from the disk.

Hadoop

An external sort is always slower compared to an in-memory sort, but you are no longer limited by your ram. if speed is important to you and you have several machines at hand, take a look at hadoop (mentioned in other replies). It does an external sort where all individual sort operations can happen on many machines in parallel.

Derby

If your restriction is only to not use an external database system, you could try an embedded database (e.g. Apache Derby). That way, you get all the advantages of a database without any external infrastructure dependencies.

Unix sort

cat large_file|sort

The Algorithmic details of UNIX Sort command says Unix Sort uses an External R-Way merge sorting algorithm. The link goes into more details, but in essence it divides the input up into smaller portions (that fit into memory) and then merges each portion together at the end.

The sort command stores working data in temporary disk files (usually in /tmp).


0 comments:

Post a Comment