How can I check if two gzipped files are equal?

@deroberts answer is great, though I want to share some other information that I have found.

gzip -l -v

gzip-compressed files contain already a hash (not secure though, see this SO post):

$ echo something > foo
$ gzip foo
$ gzip -v -l foo.gz 
method  crc     date  time           compressed        uncompressed  ratio uncompressed_name
defla 18b1f736 Feb  8 22:34                  34                  10 -20.0% foo

One can combine the CRC and uncompressed size to get a quick fingerprint:

gzip -v -l foo.gz | awk '{print $2, $7}'

cmp

For checking whether two bytes are equal or not, use cmp file1 file2. Now, a gzipped file has some header with the data and footer (CRC plus original size) appended. The description of the gzip format shows that the header contains the time when the file was compressed and that the file name is a nul-terminated string that is appended after the 10-byte header.

So, assuming that the file name is constant and the same command (gzip "$name") is used, one can check whether two files are different by using cmp and skipping the first bytes including the time:

cmp -i 8 file1 file2

Note: the assumption that the same compression options is important, otherwise the command will always report the file as different. This happens because the compression options are stored in the header and may affect the compressed data. cmp just looks at raw bytes and do not interpret it as gzip.

If you have filenames of the same length, then you could try to calculate the bytes to be skipped after reading the filename. When the filenames are of different size, you could run cmp after skipping bytes, like cmp <(cut -b9- file1) <(cut -b10- file2).

zcmp

This is definitely the best way to go, it first compresses data and starts comparing the bytes with cmp (really, this is what is done in the zcmp (zdiff) shellscript).

One note, do not be afraid of the following note in the manual page:

When both files must be uncompressed before comparison, the second is uncompressed to /tmp. In all other cases, zdiff and zcmp use only a pipe.

When you have a sufficiently new Bash, compression will not use a temporary file, just a pipe. Or, as the zdiff source says:

# Reject Solaris 8's buggy /bin/bash 2.03.

You can use zcmp or zdiff as mreithub suggests in his comment (or Kevin's command, which is similar). These will be relatively inefficient, as they actually decompress both files and then pass them off to cmp or diff. If you just want to answer "are they the same", you want cmp, it'll be much faster.

Your approach with the md5sum is perfectly good, but you need to take the MD5 before running gzip. Then store it in a file alongside the resulting .gz file. You can then compare the file easily, before compressing it. If the name is the same, md5sum -c will do this for you.

$ mkdir "backup1"
$ cd backup1
$ echo "test" > backup-file
$ md5sum backup-file > backup-file.md5
$ gzip -9 backup-file

And the next backup:

$ mkdir "backup2"
$ cd backup2
$ echo "test" > backup-file
$ md5sum -c ../backup1/backup-file.md5 
backup-file: OK

So it hasn't changed. OTOH, had it changed:

$ echo "different" > backup-file
$ md5sum -c ../backup1/backup-file.md5 
backup-file: FAILED
md5sum: WARNING: 1 computed checksum did NOT match

If you pass --quiet to it, it'll just give you the exit code. 0 for matched, non-0 for differed.

MD5 is fairly quick, but not blazingly so. MD4 (openssl md4 is the best you get on the command line, I believe) is around twice as fast (neither it nor MD5 is secure, but both are about as collision resistant when no one is attempting to subvert them). SHA-1 (sha1sum) is more secure, but slower; SHA-256 (sha256sum) is secure, but even slower still. CRC32 should be many times faster, but is shorter and thus will have more random collisions. Its also entirely insecure.

How can I check if two gzipped files are equal?

gzip -l -v

cmp

zcmp

Tags:

File Comparison

Gzip

Related

Recent Posts