unix - split a huge .gz file by line
pipe to split use either gunzip -c or zcat to open the file
gunzip -c bigfile.gz | split -l 400000
Add output specifications to the split command.
How to do this best depends on what you want:
- Do you want to extract a single part of the large file?
- Or do you want to create all the parts in one go?
If you want a single part of the file, your idea to use gunzip
and head
is right. You can use:
gunzip -c hugefile.txt.gz | head -n 4000000
That would output the first 4000000 lines on standard out - you probably want to append another pipe to actually do something with the data.
To get the other parts, you'd use a combination of head
and tail
, like:
gunzip -c hugefile.txt.gz | head -n 8000000 |tail -n 4000000
to get the second block.
Is perhaps doing a series of these a solution or would the gunzip -c require enough space for the entire file to be unzipped
No, the gunzip -c
does not require any disk space - it does everything in memory, then streams it out to stdout.
If you want to create all the parts in one go, it is more efficient to create them all with a single command, because then the input file is only read once. One good solution is to use split
; see jim mcnamara's answer for details.
As you are working on a (non-rewindable) stream, you will want to use the '+N' form of tail to get lines starting from line N onwards.
zcat hugefile.txt.gz | head -n 40000000
zcat hugefile.txt.gz | tail -n +40000001 | head -n 40000000
zcat hugefile.txt.gz | tail -n +80000001 | head -n 40000000