Using pv with md5sum
The pv
utility is a "fancy cat
", which means that you may use pv
in most situations where you would use cat
.
Using cat
with md5sum
, you can compute the MD5 checksum of a single file with
cat file | md5sum
or, with pv
,
pv file | md5sum
Unfortunately though, this does not allow md5sum
to insert the filename into its output properly.
Now, fortunately, pv
is a really fancy cat
, and on some systems (Linux), it's able to watch the data being passed through another process. This is done by using its -d
option with the process ID of that other process.
This means that you can do things like
md5sum dir/* | sort >sums &
sleep 1
pv -d "$(pgrep -n md5sum)"
This would allow pv
to watch the md5sum
process. The sleep
is there to allow md5sum
, which is running in the background, to properly start. pgrep -n md5sum
would return the PID of the most recently started md5sum
process that you own. pv
will exit as soon as the process that it is watching terminates.
I've tested this particular way of running pv
a few times and it seems to generally work well, but sometimes it seems to stop outputting anything as md5sum
switches to the next file. Sometimes, it seems to spawn spurious background tasks in the shell.
It would probably be safest to run it as
md5sum dir/* >sums &
sleep 1
pv -W -d "$!"
sort -o sums sums
The -W
option will cause pv
to wait until there's actual data being transferred, although this does also not always seem to work reliably.
The data that you are feeding through the pipe is not the data of the files that md5sum
is processing, but instead the md5sum
output, which, for every file, consists of one line comprising: the MD5-hash, two spaces, and the file name. Since we know this in advance, can inform pv
accordingly, so as to enable it to display an accurate progress indicator. There are two ways of doing so.
The first, preferred method (suggested by frostschutz) makes use of the fact that md5sum
generates one line per processed file, and the fact that pv
has a line mode that counts lines rather than bytes. In this mode pv
will only move the progress bar when it encounters a newline in the throughput, i.e. per file finished by md5sum
. In Bash, this first method can look like this:
set -- *.iso; md5sum "$@" | pv --line-mode -s $# | sort
The set
builtin is used to set the positional parameters to the files to be processed (the *.iso
shell pattern is expanded by the shell). md5sum
is then told to process these files ($@
expands to the positional parameters), and pv
in line mode will move the progress indicator each time a file has been processed / a line is output by md5sum
. Notably, pv
is informed of the total number of lines it can expect (-s $#
), as the special shell parameter $#
expands to the number of positional arguments.
The second method is not line-based but byte-based. With md5sum
this unnecessarily complicated, but some other program may not produce lines but for instance continuous data, and then this approach may be more practical. I illustrate it with md5sum
though. The idea is to calculate the amount of data that md5sum
(or some other program) will produce, and use this to inform pv
. In Bash, this could look as follows:
os=$(( $( ls -1 | wc -c ) + $( ls -1 | wc -l ) * 34 ))
md5sum * | pv -s $os | sort
The first line calculates the output size (os
) estimate: the first term is the number of bytes necessary for encoding the filenames (incl. newline), the second term the number of bytes used for encoding the MD5-hashes (32 bytes each), plus 2 spaces. In the second line, we tell pv
that the expected amount of data is os
bytes, so that it can show an accurate progress indicator leading up to 100% (which indicator is updated per finished md5summed file).
Obviously, both methods are only practical in case multiple files are to be processed. Also, it should be noted that since the output of md5sum
is not related to the amount of time the md5sum
program has to spend crunching the underlying data, the progress indicator may be considered somewhat misleading. E.g., in the second method, the file with the shortest name will yield the lowest progress update, even though it may actually be the biggest in size. Then again, if all files have a similar sizes and names, this shouldn't matter much.
Here's a dirty hack to get progress per file:
for f in iso/*
do
pv "$f" | (
cat > /dev/null &
md5sum "$f"
wait
)
done
What it looks like:
4.15GiB 0:00:32 [ 130MiB/s] [================================>] 100%
0db0b36fc7bad7b50835f68c369e854c iso/KNOPPIX_V7.6.1DVD-2016-01-16-EN.iso
792MiB 0:00:06 [ 130MiB/s] [================================>] 100%
97537db63e61d20a5cb71d29145b2937 iso/archlinux-2016.10.01-dual.iso
843MiB 0:00:06 [ 129MiB/s] [================================>] 100%
1b5dc31e038499b8409f7d4d720e3eba iso/lubuntu-16.04-desktop-i386.iso
259MiB 0:00:02 [ 130MiB/s] [=========> ] 30% ETA 0:00:04
...
Now, this makes several assumptions. Firstly, that reading data is slower than hashing it. Secondly, that OS will cache the I/O so data won't be (physically) read twice even though pv
and md5sum
are completely independent readers.
The nice thing about such a dirty, dirty hack is that you can easily adapt it to make a progress bar across all the data, not just one file. And still do weird stuff like sort the output afterwards.
pv iso/* | (
cat > /dev/null &
md5sum iso/* | sort
wait
)
What it looks like (ongoing):
15.0GiB 0:01:47 [ 131MiB/s] [===========================> ] 83% ETA 0:00:21
What it looks like (finished):
18.0GiB 0:02:11 [ 140MiB/s] [================================>] 100%
0db0b36fc7bad7b50835f68c369e854c iso/KNOPPIX_V7.6.1DVD-2016-01-16-EN.iso
155603390e65f2a8341328be3cb63875 iso/systemrescuecd-x86-4.2.0.iso
1b5dc31e038499b8409f7d4d720e3eba iso/lubuntu-16.04-desktop-i386.iso
1b6ed6ff8d399f53adadfafb20fb0d71 iso/systemrescuecd-x86-4.4.1.iso
25715326d7096c50f7ea126ac20eabfd iso/openSUSE-13.2-KDE-Live-i686.iso
...
Now, that's for the hacks. Check other answers for proper solutions. ;-)