How can I check if two gzipped files are equal?
@deroberts answer is great, though I want to share some other information that I have found.
gzip -l -v
gzip-compressed files contain already a hash (not secure though, see this SO post):
$ echo something > foo
$ gzip foo
$ gzip -v -l foo.gz
method crc date time compressed uncompressed ratio uncompressed_name
defla 18b1f736 Feb 8 22:34 34 10 -20.0% foo
One can combine the CRC and uncompressed size to get a quick fingerprint:
gzip -v -l foo.gz | awk '{print $2, $7}'
cmp
For checking whether two bytes are equal or not, use cmp file1 file2
. Now, a gzipped file has some header with the data and footer (CRC plus original size) appended. The description of the gzip format shows that the header contains the time when the file was compressed and that the file name is a nul-terminated string that is appended after the 10-byte header.
So, assuming that the file name is constant and the same command (gzip "$name"
) is used, one can check whether two files are different by using cmp
and skipping the first bytes including the time:
cmp -i 8 file1 file2
Note: the assumption that the same compression options is important, otherwise the command will always report the file as different. This happens because the compression options are stored in the header and may affect the compressed data. cmp
just looks at raw bytes and do not interpret it as gzip.
If you have filenames of the same length, then you could try to calculate the bytes to be skipped after reading the filename. When the filenames are of different size, you could run cmp
after skipping bytes, like cmp <(cut -b9- file1) <(cut -b10- file2)
.
zcmp
This is definitely the best way to go, it first compresses data and starts comparing the bytes with cmp
(really, this is what is done in the zcmp
(zdiff
) shellscript).
One note, do not be afraid of the following note in the manual page:
When both files must be uncompressed before comparison, the second is uncompressed to /tmp. In all other cases, zdiff and zcmp use only a pipe.
When you have a sufficiently new Bash, compression will not use a temporary file, just a pipe. Or, as the zdiff
source says:
# Reject Solaris 8's buggy /bin/bash 2.03.
You can use zcmp
or zdiff
as mreithub suggests in his comment (or Kevin's command, which is similar). These will be relatively inefficient, as they actually decompress both files and then pass them off to cmp
or diff
. If you just want to answer "are they the same", you want cmp
, it'll be much faster.
Your approach with the md5sum
is perfectly good, but you need to take the MD5 before running gzip
. Then store it in a file alongside the resulting .gz
file. You can then compare the file easily, before compressing it. If the name is the same, md5sum -c
will do this for you.
$ mkdir "backup1"
$ cd backup1
$ echo "test" > backup-file
$ md5sum backup-file > backup-file.md5
$ gzip -9 backup-file
And the next backup:
$ mkdir "backup2"
$ cd backup2
$ echo "test" > backup-file
$ md5sum -c ../backup1/backup-file.md5
backup-file: OK
So it hasn't changed. OTOH, had it changed:
$ echo "different" > backup-file
$ md5sum -c ../backup1/backup-file.md5
backup-file: FAILED
md5sum: WARNING: 1 computed checksum did NOT match
If you pass --quiet
to it, it'll just give you the exit code. 0 for matched, non-0 for differed.
MD5 is fairly quick, but not blazingly so. MD4 (openssl md4
is the best you get on the command line, I believe) is around twice as fast (neither it nor MD5 is secure, but both are about as collision resistant when no one is attempting to subvert them). SHA-1 (sha1sum
) is more secure, but slower; SHA-256 (sha256sum
) is secure, but even slower still. CRC32 should be many times faster, but is shorter and thus will have more random collisions. Its also entirely insecure.