Verifying a large directory after copy from one hard drive to another
I’d simply use the diff
command:
diff -rq --no-dereference /path/to/old/drive/ /path/to/new/drive/
This reads and compares every file in the directory trees and reports any differences. The -r
flag compares the directories recursively while the -q
flag just prints a message to screen when files differ – as opposed to printing the actual differences (as it does for text files). The --no-dereference
flag may be useful if there are symbolic links that differ, e.g., in one directory, a symbolic link, and in its corresponding directory, a copy of the file that was linked to.
If the diff
command prints no output, that means the directory trees are indeed identical; you can run echo $?
to verify that its exit status is 0
, indicating that both sets of files are the same.
I don’t think computing CRCs or checksums is particularly beneficial in this case. It would make more sense if the two sets of files were on different systems and each system could compute the checksums for their own set of files so only the checksums need to be sent over the network. Another common reason for computing checksums is to keep a copy of the checksums for future use.
rsync is often used to copy files instead of gcp
, but it can also be used to verify a copy, however it was made. Simply do
rsync -niaHc /origfolder/ /copyfolder
Be careful to end the first folder name (the source) with a /
.
The options are
-n
do not copy (make no changes)-i
itemise the differences-a
preserve (i.e. compare since we have-n
) permissions, ownerships, symbolic links, etc. and recurse down directories-H
preserve hard links-c
compare checksums
The output shows a code detailing the differences for each file or directory that differs. There is no output if they are the same. The code has columns YXcstpoguax
where each character is a dot .
if that aspect of the comparison is ok, or a letter:
Y is type of update:
< sent (not appropriate in this case)
> need to copy
c missing file or directory
h is hard link
. no update
* and rest of line is a message, eg *deleting
X file type: f file d dir L symlink D device S special file
c checksum differs. + new item " " same
s size differs
t timestamp differs
p permissions differ
o owner differ
g group differ
u (not used)
a acl differ
x extended attributes differ
For example,
.d..t...... a/b/ directory timestamp differs
cL+++++++++ a/b/d -> /nosuch2 symbolic link missing
cS+++++++++ a/b/f special file missing (a/b/f is a fifo)
>f..t...... a/b/ff file timestamp differs
hf a/b/xx1 => a/b/xx files should be a hard linked
cLc.t...... a/b/z -> /tmp/hi2 symbolic link to different name
cd+++++++++ a/c/ directory missing
>f+++++++++ a/c/i.10 missing file needs to be copied
See man rsync
under --itemize-changes
for more details. If you have differences in the 3rd c
or 4th s
columns, then you have serious data corruption. Other flags such as different permissions, owner or timestamps may be less important to you. If all the files are marked as "missing" then you have probably not given the right directories to compare. If you are sure, running rsync without the -n
flag will "fix" the differences.
I had the same question and I used Anthony's answer, with a bit of twist.
Applying directly his answer will fail in case of some hardware failure (like input/output error) which forces diff to exit.
I compiled his answer, along with this answer, and put it altogether into this:
find /path/to/original -type f -exec bash -c 'diff -rq --no-dereference "$@" "/path/to/destination/$(sed -r "s/^.*(<first-common-ancestor>.*)$/\1/g" <<<"$@")"' bash {} \;
- Replace
/path/to/original
with the path of the original directory you copied. - Replace
/path/to/destination
with the path of the destination directory you copied to. - Replace
<first-common-ancestor>
with the common ancestor directory between both. Example: you are copying from/media/foo/bar
to/media/test/dst/
, so thatdst
, after the copy operation is done, has the directorybar
. The first common ancestor isbar
here; because all files underbar
will have the same relative path.
Some notes:
- The
bash -c
andbash {}
parts are to perform safe substitution for files name; to be on the safe side not be harmed with possible attacks (like privilege elevation). - The
sed
part is to remove the absolute path of the file found and use only relative path (this is different from usingexecdir
). If you are not sure how this is useful, try removing it and check the error messages :) - The
<<<
to read the variable as a string rather than reading it as a path to a file to read.