Verifying a large directory after copy from one hard drive to another

I’d simply use the diff command:

diff -rq --no-dereference /path/to/old/drive/ /path/to/new/drive/

This reads and compares every file in the directory trees and reports any differences. The -r flag compares the directories recursively while the -q flag just prints a message to screen when files differ – as opposed to printing the actual differences (as it does for text files). The --no-dereference flag may be useful if there are symbolic links that differ, e.g., in one directory, a symbolic link, and in its corresponding directory, a copy of the file that was linked to.

If the diff command prints no output, that means the directory trees are indeed identical; you can run echo $? to verify that its exit status is 0, indicating that both sets of files are the same.

I don’t think computing CRCs or checksums is particularly beneficial in this case. It would make more sense if the two sets of files were on different systems and each system could compute the checksums for their own set of files so only the checksums need to be sent over the network. Another common reason for computing checksums is to keep a copy of the checksums for future use.

rsync is often used to copy files instead of gcp, but it can also be used to verify a copy, however it was made. Simply do

rsync -niaHc /origfolder/ /copyfolder

Be careful to end the first folder name (the source) with a /. The options are

-n do not copy (make no changes)
-i itemise the differences
-a preserve (i.e. compare since we have -n) permissions, ownerships, symbolic links, etc. and recurse down directories
-H preserve hard links
-c compare checksums

The output shows a code detailing the differences for each file or directory that differs. There is no output if they are the same. The code has columns YXcstpoguax where each character is a dot . if that aspect of the comparison is ok, or a letter:

Y is type of update: 
   < sent (not appropriate in this case)
   > need to copy 
   c missing file or directory
   h is hard link
   . no update
   * and rest of line is a message, eg *deleting
X file type: f file  d dir  L symlink  D device S special file
c checksum differs. + new item  " " same
s size differs
t timestamp differs
p permissions differ
o owner differ
g group differ
u (not used)
a acl differ
x extended attributes differ

For example,

.d..t...... a/b/                    directory timestamp differs
cL+++++++++ a/b/d -> /nosuch2       symbolic link missing
cS+++++++++ a/b/f                   special file missing (a/b/f is a fifo)
>f..t...... a/b/ff                  file timestamp differs
hf          a/b/xx1 => a/b/xx       files should be a hard linked
cLc.t...... a/b/z -> /tmp/hi2       symbolic link to different name
cd+++++++++ a/c/                    directory missing
>f+++++++++ a/c/i.10                missing file needs to be copied

See man rsync under --itemize-changes for more details. If you have differences in the 3rd c or 4th s columns, then you have serious data corruption. Other flags such as different permissions, owner or timestamps may be less important to you. If all the files are marked as "missing" then you have probably not given the right directories to compare. If you are sure, running rsync without the -n flag will "fix" the differences.

I had the same question and I used Anthony's answer, with a bit of twist.

Applying directly his answer will fail in case of some hardware failure (like input/output error) which forces diff to exit.

I compiled his answer, along with this answer, and put it altogether into this:

find /path/to/original -type f -exec bash -c 'diff -rq --no-dereference "$@" "/path/to/destination/$(sed -r "s/^.*(<first-common-ancestor>.*)$/\1/g" <<<"$@")"' bash {} \;

Replace /path/to/original with the path of the original directory you copied.
Replace /path/to/destination with the path of the destination directory you copied to.
Replace <first-common-ancestor> with the common ancestor directory between both. Example: you are copying from /media/foo/bar to /media/test/dst/, so that dst, after the copy operation is done, has the directory bar. The first common ancestor is bar here; because all files under bar will have the same relative path.

Some notes:

The bash -c and bash {} parts are to perform safe substitution for files name; to be on the safe side not be harmed with possible attacks (like privilege elevation).
The sed part is to remove the absolute path of the file found and use only relative path (this is different from using execdir). If you are not sure how this is useful, try removing it and check the error messages :)
The <<< to read the variable as a string rather than reading it as a path to a file to read.

Verifying a large directory after copy from one hard drive to another

Tags:

Ubuntu

Checksum

File Transfer

Files

File Copy

Related

Recent Posts