Memory problems when compressing and transferring a large number of small files (1TB in total)
Additional information provided in the comments reveals that the OP is using a GUI method to create the .tar.gz
file.
GUI software often includes a lot more bloat than the equivalent command line equivalent software, or performs additional unnecessary tasks for the sake of some "extra" feature such as a progress bar. It wouldn't surprise me if the GUI software is trying to collect a list of all the filenames in memory. It's unnecessary to do that in order to create an archive. The dedicated tools tar
and gzip
are defintely designed to work with streaming input and output which means that they can deal with input and output a lot bigger than memory.
If you avoid the GUI program, you can most likely generate this archive using a completely normal everyday tar
invocation like this:
tar czf foo.tar.gz foo
where foo
is the directory that contains all your 5 million files.
The other answers to this question give you a couple of additional alternative tar
commands to try in case you want to split the result into multiple pieces, etc...
"five million" files, and 1TB in total? Your files must be very small, then. I'd simply try rsync
:
rsync -alPEmivvz /source/dir remote.host.tld:/base/dir
If you don't have that - or your use-case doesn't allow for using rsync
, I'd at least check if 7z
works with your data. It might not, but I think it's still worth a try:
7z a archive.7z /source/dir
Or if you don't feel comfortable with 7z
at least try making a .tar.xz
archive:
tar cJv archive.tar.xz /source/dir
(it should be noted, that older versions of tar
don't create .tar.xz
archives, but .tar.lzma
archives, when using the J
switch. Even yet older versions of tar
, don't support the J
flag altogether.)
Since you're using a GUI program to create those files, I'm assuming you're feeling a bit uncomfortable using a command line interface.
To facilitate creation, management and extraction of archives from the command line interface, there's the small utility called atool
. It is available for practically every common distro I've seen, and works pretty much every single archive I've stumbled upon, unless the hopelessly obscure ones.
Check whether your distro has atool
in their repos, or ask your admin to install it, when it's in a workplace environment.
atool
installs a bunch of symlinks to itself, so packing and unpacking becomes a breeze:
apack archive.tar.xz <files and/or directories>
Creates an archive.
aunpack archive.7z
Expands the archive.
als archive.rar
Lists file contents.
What kind of archive is created, atool
discerns that by the filename extension of your archive in the command line.
Unless you can do better than 25:1 compression you are unlikely to gain anything from compressing this before snail-mailing, unless you have some hardware tape format that you can exchange the the third party.
The largest common storage is blue ray and that will roughly get you 40Gb. You would need 25 to 1 compression on your data to get it to fit on that. If your third party only has DVD you need 125:1 (roughly).
If you cannot match those compression numbers just use a normal disc, copy and snail mail that to the third party. In that case shipping something smaller than a 1Tb drive that would need compression is madness.
You just have to compare that to using ssh -C
(standard compression) or preferably rsync
with compression to copy the files over the network, no need to compress and tar up front. 1Tb is not impossible to move over the net, but it is going to take a while.