Sort with the limited memory

Let us assume you have a huge file H and limit memory M.
I have two solutions.

Solution 1:
step 1: Read M from the file and write it into a temp buffer.
step 2: Sort (you should use in-place sorting algorithm, like QuickSort,HeapSort) the values.
step 3: Create a temp file and write the buffer into the temp file. Save the name of the temp file.
step 4: Repeat step 1 to 3 until read file H out, and sort M out and save all temp files.
step 5: Merge all temp files to a new file. Create a new file,and open all the temp files,put the file handles into a array.
step 6: Find the minimum number from the set of numbers currently pointed to by the file read pointer. You can use normal way to do that, compare every number, and use a temp variable to remember it (time complexity is O(n). Or you can create a priorityQueue,and a map, the key of the map is the number, and the value of the map is the sequence of the file handles.(time complexity is O(lgn),the second way wastes more memory, but performance is better,if you want a better way, you can use a int to replace list using bitmap, becase the temp file names are sequeced.
step 7: Save the number to the new file.
step 8: Read another number from the file that contained the minimum number at step 6.
step 9: Repeat step 7 to 8 until all numbers from all the temp files have been processed.

Solution 2: step 1 : Create N temp files, the range of the numbers of every temp files are different.(For example,the range of temp_file_1 is from 0 to 1000000, and the next temp file is from 1000001 to 2000000...)
step 2: Read m from the H file and write the number into the different temp files until nothing can read from file H.
step 3: Sort every temp files.
step 4: Create a new file, merge all temp files into this new file one by one.

What's the difference between the solutions. According to solution 1, every number is sorted once and is compared many times(step 5), but read and write only twice. about solution 2, no merge processing, but every single number is read and wrote three times.


You are looking for external sorting. The largest cost in these scenarios is often disk IO. So the trick is to use an algorithm that minimises disk IO. The usual approach is to read suitably large chunks in to memory, sort those chunks, saving them back to disk, then merge the sorted chunks.

A search for "external sorting" or "sort merge" and your technologies of choice should give some good results.

Tags:

Sorting

Memory