Efficiently remove file(s) from large .tgz
With GNU tar
, you can do:
pigz -d < file.tgz |
tar --delete --wildcards -f - '*/prefix*.jpg' |
pigz > newfile.tgz
With bsdtar
:
pigz -d < file.tgz |
bsdtar -cf - --exclude='*/prefix*.jpg' @- |
pigz > newfile.tgz
(pigz
being the multi-threaded version of gzip
).
You could overwrite the file over itself like:
{ pigz -d < file.tgz |
tar --delete --wildcards -f - '*/prefix*.jpg' |
pigz &&
perl -e 'truncate STDOUT, tell STDOUT'
} 1<> file.tgz
But that's quite risky, especially if the result ends up being less compressed than the original file (in which case, the second pigz
may end up overwriting areas of the file which the first one has not read yet).
Don't discount the easy way: it may be fast enough for your purpose. With avfs to access the archive as a directory:
cd ~/.avfs/path/to/original.tar.gz\#
pax -w -s '/^.*\.jpg$//' | gzip >/path/to/filtered.tar.gz # POSIX
tar -czf /path/to/filtered.tar.gz -s '/^.*\.jpg$//' . # BSD
tar -czf /path/to/filtered.tar.gz --transform '/^.*\.jpg$//' . # GNU
With more primitive tools, first extract the files excluding the .jpg
files, then create a new archive.
mkdir tmpdir && cd tmpdir
<original.tar.gz gzip -d | pax -r -pe -s '/^.*\.jpg$//'
pax -w . | gzip >filtered.tar.gz
cd .. && rm -rf tmpdir
If your tar has --exclude
:
mkdir tmpdir && cd tmpdir
tar -xzf original.tar.gz --exclude='*.jpg'
tar -czf filtered.tar.gz .
cd .. && rm -rf tmpdir
This may however mangle file ownership and modes if you don't run it as root. For best results, use a temporary directory on a fast filesystem — tmpfs if you have one that's large enough.
Support for archivers to act as a pass-through (i.e read an archive and write an archive) tends to be limited. GNU tar can delete members from an archive with the --delete
operation option (“The --delete
option has been reported to work properly when tar
acts as a filter from stdin
to stdout
.”), and that's probably your best option.
You can make powerful archive filters in a few lines of Python. Its tarfile
library can read and write from non-seekable streams, and you can use arbitrary code in Python to filter, rename, modify…
#!/usr/bin/python
import re, sys, tarfile
source = tarfile.open(fileobj=sys.stdin, mode='r|*')
dest = tarfile.open(fileobj=sys.stdout, mode='w|gz')
for member in source:
if not (member.isreg() and re.match(r'.*\.jpg\Z', member.name)):
sys.stderr.write(member.name + '\n')
dest.addfile(member, source.extractfile(member))
dest.close()
With the tar that comes on Mac OSX, you could do this:
tar -czf b.tgz --exclude '*.jpg' @a.tgz
mv b.tgz a.tgz