How to Calculate a Hash of a file that is 1 Terabyte and over?
If you have a 1 million MB file, and your system can read this file at 100MB/s, then
- 1TB * 1000(TB/GB) = 1000 GB
- 1000GB * 1000(MB/GB) = 1 million MB
- 1 million MB/100(MB/s) = 10 thousand seconds
- 10000s/3600(s/hr) = 2.77... hr
- Therefore, a 100MB/s system has a hard floor of 2.77... hrs to even read the file in the first place, even before whatever additional total time may be required to compute a hash.
Your expectations are probably unrealistic - don't try to calculate a faster hash until you can perform a faster file read.
Old and already answered, but you may try to select specific chunks of file.
There is a perl solution i found somewhere and it that seems effective, code not mine:
#!/usr/bin/perl
use strict;
use Time::HiRes qw[ time ];
use Digest::MD5;
sub quickMD5 {
my $fh = shift;
my $md5 = new Digest::MD5->new;
$md5->add( -s $fh );
my $pos = 0;
until( eof $fh ) {
seek $fh, $pos, 0;
read( $fh, my $block, 4096 ) or last;
$md5->add( $block );
$pos += 2048**2;
}
return $md5;
}
open FH, '<', $ARGV[0] or die $!;
printf "Processing $ARGV[0] : %u bytes\n", -s FH;
my $start = time;
my $qmd5 = quickMD5( *FH );
printf "Partial MD5 took %.6f seconds\n", time() - $start;
print "Partial MD5: ", $qmd5->hexdigest, "\n";
Basically the script perform MD5 on first 4KB for every 4MB block in file (actually original one did every 1MB).