Sort the contents of an extremely large (800GB) text file on Windows
One other option is to load the file into a Database. E.G MySQL and MySQL Workbench.
Databases are perfect candidates for working with large files
If your input file contains just words seperated by a new line this shouldn't be to hard.
After you've installed the database and MySQL Workbench this is what you'd need to do.
First create the schema (this assumes words won't be longer that 255 characters although you could alter this by increasing the argument value).
The first column "idwords" is a primary key.
CREATE SCHEMA `tmp` ;
CREATE TABLE `tmp`.`words` (
`idwords` INT NOT NULL AUTO_INCREMENT,
`mywords` VARCHAR(255) NULL,
PRIMARY KEY (`idwords`));
Secondly import the data: E.G This will import all the words into the table (this step may take a while to complete. My advise would be to run a test with a small words file first and once you are sure the format is the same as the larger one (truncate the table.. I.E Clear it out and load the full data set).
LOAD DATA LOCAL INFILE "C:\\words.txt" INTO TABLE tmp.words
LINES TERMINATED BY '\r\n'
(mywords);
This link may help with getting the format right for the load.
https://dev.mysql.com/doc/refman/5.7/en/load-data.html
E.G If you needed to skip the first line you'd do the following.
LOAD DATA LOCAL INFILE "H:\\words.txt" INTO TABLE tmp.words
-- FIELDS TERMINATED BY ','
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES
(mywords);
Finally Save the sorted file. This may take a while also depending on you pc.
SELECT tmp.words.mywords
FROM tmp.words
order by tmp.words.mywords asc
INTO OUTFILE 'C:\\sorted_words.csv';
You can also search the data at will as you like. E.G This will give you the first 50 words in ascending order (starting from the 0th or first word).
SELECT tmp.words.mywords
FROM tmp.words
order by tmp.words.mywords asc
LIMIT 0, 50 ;
Good luck
Pete
What are my options?
Try Freeware Command Line Sort Utility CMSort.
It uses multiple temporary files and then merges them at the end.
CMsort is reading records of an input file until the adjusted memory is reached. Then the records are sorted and written to a temporary file. This will be repeated until all records are processed. Finally, all temporary files are merged into the output file. If the available memory is sufficient, no temporary files are written and no merging is needed.
One user reports it sorted a file of 130,000,000 bytes.
If you want to tweak some code yourself, there is also Sorting Huge Text Files - CodeProject - "Algorithm of sorting lines in text files size of which exceeds available memory"
sort
There are many algorithms used to sort ordered and not ordered files [1].
Since all those algorithms are already implemented, pick a program already tested.
In coreutils (from Linux but available for windows too [2]), it exists the sort
command capable to run in parallel under multi-core processors: usually it is enough.
If your file is so huge you can help the processing splitting (split -l
), the file in some chunks, possibly using the parallel option (--parallel
), and sorting the resulted ordered-chunks with the -m
option (merge sort).
One of the many ways to do it is explained here (split file, order single chunks, merge ordered chunks, delete temp files).
Notes:
- In windows 10 there exist the so called Windows Subsystem for Linux in which all the Linux example will seem more natural.
- Sorting with different algorithms has different execution times that scale as function of the number of data entries to be sorted (O(nm), O(nlogn)...).
- The efficiency of the algorithm depends on the order that is already present in the original file.
(For example a bubble sort is the most fast algorithm for an already ordered file -- exactly N --, but it is not efficient in other cases).