extract single file from huge tgz file
Unfortunately, in order to unpack single member of .tar.gz
archive you have to process whole archive, and not much you can do to fix it.
This is where .zip
(and some other formats like .rar
) archives work much better, because zip
format has central directory of all files contained in it with direct offsets pointing to the middle of the zip
file, so archive members can be quickly extracted without processing whole thing.
You might ask why processing .tar.gz
is so slow?
.tar.gz
(often shortened as .tgz
) is simply .tar
archive compressed with gzip
compressor. gzip
is streaming compressor that can only work with one file. If you want to get any part of gzip
stream, you have to uncompress it as a whole, and this is what really kills it for .tar.gz
(and for .tar.bz2
, .tar.xz
and other similar formats based on .tar
).
.tar
format is actually very, very simple. It is simply stream of 512-byte file or directory headers (name, size, etc), each followed by file or directory contents (padded to 512 block size with 0 bytes if necessary). When you observe totally null 512 block for a header, this means end of .tar
archive.
Some people think that even .tar
archive members cannot be accessed quickly, but this is not quite true. If .tar
archive contains few big files, you actually can quickly seek into next header, and thus you can find necessary archive member in few seeks (but still could require as many seeks as there are archive members). If your .tar
archive contains of lots of tiny files, this means quick member retrieval becomes effectively impossible even for uncompressed .tar
.
If you're extracting just one file from a large tar file, you're using GNU tar
, and you can guarantee that the tar file has never been appended to then you can get a significant performance boost by using --occurrence
.
This option tells tar to stop as soon as it finds the first occurrence of each file you've requested, so e.g.
tar xf large-backup.tar --occurrence etc/passwd etc/shadow
will not spool through the whole tarball after it finds one copy of each of passwd
and shadow
, instead it will stop. If those files appear near the end the performance gain won't be much, but if they appear even half way through a 500G file you'll save a lot of time.
For people using tar
for single shot backups and not using real tape drives this situation is probably the typical case.
Note that you can also pass --occurrence=NUMBER
to retrieve the NUMBERth occurrence of each file, which helps if you know that there are multiple versions in the archive. By default the behavior is equal to a NUMBER
of 1.
When dealing with a large tarball use:
--fast-read
to extract only the first archive entry that matches filename operand,path/to/file
in this case - which is always unique in tarball anyway
tar -xvf file.tgz --fast-read path/to/file
the above will search until it finds a match and then exit