Different md5sums for same tar contents
Solution 1:
As Dennis pointed out above, it's gzip. Part of the gzip header is a mod time for whatever is compressed in the file. If you need gzip, you can compress the tarfile as an extra step outside of tar rather than using tar's internal gzip. The gzip command has a flag to suppress the saving of that modification time.
tar -c ./bin |gzip -n >one.tgz
tar -c ./bin |gzip -n >two.tgz
md5sum one.tgz two.tgz
This will not affect times inside the tarfile, only the one in the gzip header.
Solution 2:
To make a tar file with a consistent checksum, just prepend GZIP=-n
like this:
GZIP=-n tar -zcf myOutputTarball.tar /home/luke/directoryIWantToZip
How this works: Tar can accept gzip options using a temporary GZIP
environment variable, as above. Like Valter said, tar uses gzip, which by default puts a timestamp in the archive. This means you get a different checksum when you compress the same files. The -n
option disables that timestamp.
Solution 3:
I had this problem too, to make gzip do not alter the timestamp, use gzip -n
-n, --no-name do not save or restore the original name and time stamp
[valter.silva@alog ~]$ gzip --help
Usage: gzip [OPTION]... [FILE]...
Compress or uncompress FILEs (by default, compress FILES in-place).
Mandatory arguments to long options are mandatory for short options too.
-c, --stdout write on standard output, keep original files unchanged
-d, --decompress decompress
-f, --force force overwrite of output file and compress links
-h, --help give this help
-l, --list list compressed file contents
-L, --license display software license
-n, --no-name do not save or restore the original name and time stamp
-N, --name save or restore the original name and time stamp
-q, --quiet suppress all warnings
-r, --recursive operate recursively on directories
-S, --suffix=SUF use suffix SUF on compressed files
-t, --test test compressed file integrity
-v, --verbose verbose mode
-V, --version display version number
-1, --fast compress faster
-9, --best compress better
--rsyncable Make rsync-friendly archive
With no FILE, or when FILE is -, read standard input.
Report bugs to <[email protected]>.
Example:
[valter.silva@alog ~]$ ls
renewClaroMMSCanaisSemanal.log.gz s3
[valter.silva@alog ~]$ gunzip renew.log.gz
[valter.silva@alog ~]$ gunzip s3/renew.log.gz
[valter.silva@alog ~]$ md5sum renew.log
d41d8cd98f00b204e9800998ecf8427e renew.log
[valter.silva@alog ~]$ md5sum s3/renew.log
d41d8cd98f00b204e9800998ecf8427e s3/renew.log
[valter.silva@alog ~]$ gzip -n renew.log
[valter.silva@alog ~]$ gzip -n s3/renew.log
[valter.silva@alog ~]$ md5sum renew.log.gz
7029066c27ac6f5ef18d660d5741979a renew.log.gz
[valter.silva@alog ~]$ md5sum s3/renew.log.gz
7029066c27ac6f5ef18d660d5741979a s3/renew.log.gz
Solution 4:
I went down a rabbit-hole after the other answers failed me, and managed to figure out that my version of tar (1.27.1 from the openSUSE 42.3 OSS repo) was using the non-deterministic pax
archival format by default, which means that even without compression, (and even setting the mtime explicitly) archives created with tar from the same files would differ:
$ echo hi > test.file
$ tar --create --to-stdout test.file # long form of `tar cO test.file`
./PaxHeaders.13067/test.file0000644000000000000000000000013213427447703012603 xustar0030 mtime=1549684675.835011178
30 atime=1549684726.410510251
30 ctime=1549684675.835011178
test.file0000644000175000001440000000000313427447703013057 0ustar00hartusers00000000000000hi
$ tar --create --to-stdout test.file
./PaxHeaders.13096/test.file0000644000000000000000000000013213427447703012605 xustar0030 mtime=1549684675.835011178
30 atime=1549684726.410510251
30 ctime=1549684675.835011178
test.file0000644000175000001440000000000313427447703013057 0ustar00hartusers00000000000000hi
Note that the output above differs, even though no compression is being used; the uncompressed archive contents (generated by running tar twice on the same contents) are different, so the compressed content will also differ even when using GZIP=-n
as other answers suggest
In order to get around this, you can specify --format gnu
:
$ tar --create --format gnu --to-stdout test.file
test.file0000644000175000001440000000000313427447703011557 0ustar hartusershi
$ tar --create --format gnu --to-stdout test.file
test.file0000644000175000001440000000000313427447703011557 0ustar hartusershi
This works with the suggestion about gzip above:
# gzip refuses to write to stdout, so we'll use the `-f` option to create a file
$ GZIP=-n tar --format gnu -czf test.file.tgz test.file && md5sum test.file.tgz
0d8c7b3bdbe8066b516e3d3af60ade75 test.file.tgz
$ GZIP=-n tar --format gnu -czf test.file.tgz test.file && md5sum test.file.tgz
0d8c7b3bdbe8066b516e3d3af60ade75 test.file.tgz
# without GZIP=-n we see a different hash
$ tar --format gnu -czf test.file.tgz test.file && md5sum test.file.tgz
682ce0c8267b90f4103b4c29903c5a8d test.file.tgz
However, in addition to valid reasons to prefer better compression formats to gzip, you might want to consider using xz instead (which tar also supports with the --xz
or -J
flags instead of -z
), because it saves you a step here; the default behaviour of xz
is to generate the same compressed output when the uncompressed contents are the same, so there's no need to specify an option like GZIP=-n
:
$ tar --format gnu --xz -cf test.file.txz test.file && md5sum test.file.txz
dea99037d4b0ee4565b3639e93ac0930 test.file.txz
$ tar --format gnu --xz -cf test.file.txz test.file && md5sum test.file.txz
dea99037d4b0ee4565b3639e93ac0930 test.file.txz