What is the difference between different "compression" systems?
tar
stands for tape archive. All it does is pack files, and their metadata ( permissions, ownership, etc ) into a stream of bytes that can be stored on a tape drive ( or a file ) and restored later. Compression is an entirely separate matter that you used to have to pipe the output through an external utility to compress if wanted that. GNU tar was nice enough to add switches to tell it to automatically filter the output through the appropriate utility as a shortcut.
Zip and 7z combine the archiving and compression together into their own container format, and they are meant to pack files on a DOS/Windows system, so they do not store unix permissions and ownership. Thus if you want to store permissions for proper backups, you need to stick with tar. If you plan on exchanging files with Windows users, then zip or 7z is good. The actual compression algorithms zip and 7zip use can be used with tar, by uzing gzip
and lzma
respectively.
lzma ( aka. *.xz ) has one of the best compression ratios, and is quite fast at decompression, making it a top choice these days. It does however, require a ton of ram and cpu time to compress. The venerable gzip
is quite a bit faster at compression, so may be used if you don't want to dedicate that much cpu time. It also has an even faster variant called lzop. bzip2
is still fairly popular as it largely replaced gzip for a time before 7zip/lzma came about, since it got better compression ratios, but is falling out of favor these days since 7z/lzma is faster at decompression and gets better compression ratios. The compress
utility, which normally names files *.Z, is ancient and long forgotten.
One of the other important differences between zip and tar is that zip compresses the data in small chunks, whereas when you compress a tar file, you compress the whole thing at once. The latter gives better compression ratios, but in order to extract a single file at the end of the archive, you must decompress the whole thing to get to it. Thus the zip format is better at extracting a single file or two from a large archive. 7z and dar
allow you to choose to compress the whole thing ( called "solid" mode ) or small chunks for easy piecemeal extraction.
The details of the algorithms are off topic here1 since they are not in any way specific to Linux, let alone Ubuntu. You will, however, find some nice info here.
Now on to tar
, as you said, tar
is not and never has been a compression program. Instead, it is an archiver; its primary purpose is to make one big file out of a lot of small ones. Historically this was to facilitate storing on tape drives, hence the name: Tape ARchive.
Today, the primary reason to use tar
is to decrease the number of files on your system. Each file on a Unix file system takes up an inode, the more files you have, the fewer inodes available and when you run out of inodes, you can no longer create new files. To put it simply, the same amount of data stored as thousands of files will take up more of your hard drive than those same files in a single tar archive.
To illustrate, since this has been contested in the comments, on my 68G /
partition, I have the following number of total and used inodes (bear in mind that inode count depends on the file system type and the size of the partition):
Inode count: 393216
Free inodes: 171421
If I now proceed to attempt to create more files than I have inodes:
$ touch {1..171422}
touch: cannot touch ‘171388’: No space left on device
touch: cannot touch ‘171389’: No space left on device
touch: cannot touch ‘171390’: No space left on device
touch: cannot touch ‘171391’: No space left on device
touch: cannot touch ‘171392’: No space left on device
touch: cannot touch ‘171393’: No space left on device
touch: cannot touch ‘171394’: No space left on device
touch: cannot touch ‘171395’: No space left on device
touch: cannot touch ‘171396’: No space left on device
touch: cannot touch ‘171397’: No space left on device
No space? But I have loads of space:
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 5,8G 4,3G 1,2G 79% /
As you can see above, creating a few hundred thousand empty files rapidly depletes my inodes and I can no longer create new ones. If I were to tar
these I would be able to start creating files again.
Having fewer files also greatly speeds up the file system I/O especially on NFS mounted filesystems. I always tar my old work directories when a project is finished since the fewer files I have, the faster programs like find
will work.
There is a great answer on Super User that goes into far more detail, but in addition to the above, the other basic reasons why tar
is still popular today are:
Efficiency: using
tar
to pipe through a compression program likegzip
is more efficient since it avoids the creation of intermediate files.tar
comes with all sorts of bells and whistles, features that have been designed over its long history that make it particularly useful for *nix backups (think permissions, file ownership, the ability to pipe data straight to STDOUT and over an SSH link...)Inertia. We're used to
tar
. It's safe to assume it will be available on any *nix you might happen to use which makes it very portable and handy for source code tarballs.
1 This is absolutely true and has nothing to do with the fact that I don't know enough about them to explain :)
There are two distinct but related tasks. Packing a tree of files (including filenames, directory structure, filesystem permissions, ownership and any other metadata) into a byte stream is called archiving. Removing redundancy in a byte stream to produce a smaller byte stream is called compression.
On Unix, the two operations are separated, with distinct tools for each. On most other platforms (current and historical) combined tools perform both archiving and compression.
(gzip and other programs that mimic gzip's interface often have the option to store the original filename in the compressed output, but this, along with a CRC or other check to detect corruption, is the only metadata they can store.)
There are advantages to separating compression from archiving. Archiving is platform-specific (the filesystem metadata needing preserving varies widely), but the implementation is straightforward, largely I/O-bound, and changes little over time. Compression is platform-independent, but implementations are CPU-bound and algorithms are constantly improving to take advantage of the increased resources that modern hardware can bring to bear on the problem.
The most popular Unix archiver is tar
, although there exist others
such as cpio
and ar
. (Debian packages are ar
archives, while
cpio
is often used for inital ramdisks.) tar
is or has often been
combined with compression tools such as compress
(.Z), gzip
(.gz),
bzip2
(.bz2) and xz
(.xz), from oldest to youngest, and not
coincidentally from worst to best compression.
Making a tar
archive and compressing it are distinct steps: the
compressor knows nothing about the tar
file format. This means that
extracting a single file from a compressed tar
archive requires
decompressing all of the preceding files. This is often called a
"solid" archive.
Equally, since tar is a "streaming" format--required for it to be useful in a pipeline--there is no global index in a tar archive, and listing the contents of a tar archive is just as expensive as extracting it.
By contrast, Zip and RAR and 7-zip (the most popular archivers on modern Windows platforms) usually compress each file separately, and compress metadata lightly if at all. This allows for cheap listing of the files in an archive and extraction of individual files, but means that redundancy between multiple files in the same archive cannot be exploited to increase compression. While in general compressing an already-compressed file does not reduce file size further, occasionally you might see a zip file within a zip file: the first zipping turned lots of small files into one big file (probably with compression disabled), which the second zipping then compressed as a single entity.
There is cross-pollination between the differing platforms and
philosophies: gzip
is essentially zip
's compressor without its
archiver, and xz
is essentially 7-zip
's compressor without its
archiver.
There are other, specialized compressors. PPM variants and their
successor ZPAQ
are optimized for maximum compression without regard to
resource consumption. They can easily chew up as much CPU and RAM as
you can throw at them, and decompression is just as taxing as
compression (for contrast, most widely-used compression tools are
asymmetric: decompressing is cheaper than compressing).
On the other end of the spectrum, lzo
, snappy
and LZ4
are "light"
compressors designed for maximum speed and minimum resource
consumption, at the cost of compression. They're widely used within
filesystems and other object stores, but less so as standalone tools.
So which should you pick?
Archiving:
Since you're on Ubuntu there's no real reason to use anything other
than tar
for archiving, unless you're trying to make files that are
easily readable elsewhere.
zip
is hard to beat for ubiquity, but it's not Unix-centric and will
not keep your filesystem permissions and ownership information, and
its baked-in compression is antiquated. 7-zip and RAR (and ZPAQ) have
more modern compression but are equally unsuited to archiving Unix
filesystems (although there's nothing stopping you using them just as
compressors); RAR is also proprietary.
Compression:
For maximum compression you can have a look at a benchmark, such as the enormous one at http://mattmahoney.net/dc/text.html. This should give you a better idea of the tradeoffs involved.
You probably don't want maximum compression, though. It's way too expensive.
xz
is the most popular general-purpose compression tool on modern Unix
systems. I believe 7-zip can read xz files too, as they are closely
related.
Finally: if you're archiving data for anything other than short-term storage you should pick something open-source and preferably widespread, to minimize headaches later on.