How to tar.gz many similar-size files into multiple archives with a size limit

Totally patchwork and a quick, rough sketch as it is, but tested on a directory with 3000 files, the script below did an extremely fast job:

#!/usr/bin/env python3
import subprocess
import os
import sys

splitinto = 2

dr = sys.argv[1]
os.chdir(dr)

files = os.listdir(dr)
n_files = len(files)
size = n_files // splitinto

def compress(tar, files):
    command = ["tar", "-zcvf", "tarfile" + str(tar) + ".tar.gz", "-T", "-", "--null"]
    proc = subprocess.Popen(command, stdin=subprocess.PIPE)
    with proc:
        proc.stdin.write(b'\0'.join(map(str.encode, files)))
        proc.stdin.write(b'\0')
    if proc.returncode:
        sys.exit(proc.returncode)

sub = []; tar = 1
for f in files:
    sub.append(f)
    if len(sub) == size:
        compress(tar, sub)
        sub = []; tar += 1

if sub:
    # taking care of left
    compress(tar, sub)

How to use

  • Save it into an empty file as compress_split.py
  • In the head section, set the number of files to compress into. In practice, there will always be one more to take care of the remaining few "left overs".
  • Run it with the directory with your files as argument:

    python3 /path/tocompress_split.py /directory/with/files/tocompress
    

numbered .tar.gz files will be created in the same directory as where the files are.

Explanation

The script:

  • lists all files in the directory
  • cd's into the directory to prevent adding the path info to the tar file
  • reads through the file list, grouping them by the set division
  • compresses the sub group(s) into numbered files

EDIT

Automatically create chunks by size in mb

More sophisticated is to use the max- size (in mb) of the chunks as a (second) argument. In the script below, the chunks are written into a compressed file as soon as the chunk reaches (passes) the threshold.

Since the script is triggered by the chunks, exceeding the threshold, this will only work if the size of (all) files is substantially smaller than the chunk size.

The script:

#!/usr/bin/env python3
import subprocess
import os
import sys

dr = sys.argv[1]
chunksize = float(sys.argv[2])
os.chdir(dr)

files = os.listdir(dr)
n_files = len(files)

def compress(tar, files):
    command = ["tar", "-zcvf", "tarfile" + str(tar) + ".tar.gz", "-T", "-", "--null"]
    proc = subprocess.Popen(command, stdin=subprocess.PIPE)
    with proc:
        proc.stdin.write(b'\0'.join(map(str.encode, files)))
        proc.stdin.write(b'\0')
    if proc.returncode:
        sys.exit(proc.returncode)

sub = []; tar = 1; subsize = 0
for f in files:
    sub.append(f)
    subsize = subsize + (os.path.getsize(f)/1000000)
    if subsize >= chunksize:
        compress(tar, sub)
        sub = []; tar += 1; subsize = 0

if sub:
    # taking care of left
    compress(tar, sub)

To run:

python3 /path/tocompress_split.py /directory/with/files/tocompress chunksize

...where chunksize is the size of input for the tar command.

In this one, the suggested improvements by @DavidFoerster are included. Thanks a lot!


A pure shell approach:

files=(*); 
num=$((${#files[@]}/8));
k=1
for ((i=0; i<${#files[@]}; i+=$num)); do 
    tar cvzf files$k.tgz -- "${files[@]:$i:$num}"
    ((k++))
done

Explanation

  • files=(*) : save the list of files (also directories if any are present, change to files=(*.txt) to get only things with a txt extension) in the array $files.
  • num=$((${#files[@]}/8)); : ${#files[@]} is the number of elements in the array $files. The $(( )) is bash's (limited) way of doing arithmetic. So, this command sets $num to the number of files divided by 8.
  • k=1 : just a counter to name the tarballs.
  • for ((i=0; i<${#files[@]}; i+=$num)); do : iterate over the values of the array. $i is initialized at 0 (the first element of the array) and incremented by $num. This continues until we've gone through all elements (files).
  • tar cvzf files$i.tgz -- ${files[@]:$i:$num} : in bash, you can get an array slice (part of an array) using ${array[@]:start:length}, So ${array[@]:2:3} will return three elements starting from the second. Here, we are taking a slice that starts at the current value of $i and is $num elements long. The -- is needed in case any of your file names can start with a -.
  • ((k++)) : increment $k