Search and Delete duplicate files with different names
There is such a program, and it's called rdfind
:
SYNOPSIS
rdfind [ options ] directory1 | file1 [ directory2 | file2 ] ...
DESCRIPTION
rdfind finds duplicate files across and/or within several directories.
It calculates checksum only if necessary. rdfind runs in O(Nlog(N))
time with N being the number of files.
If two (or more) equal files are found, the program decides which of
them is the original and the rest are considered duplicates. This is
done by ranking the files to each other and deciding which has the
highest rank. See section RANKING for details.
It can delete the duplicates, or replace them with symbolic or hard links.
Hmmph. I just developed a one-liner to list all duplicates, for a
question that turned out to be a duplicate of this. How meta. Well,
shame to waste it, so I'll post it, though rdfind
sounds like a better
solution.
This at least has the advantage of being the "real" Unix way to do it ;)
find -name '*.mp3' -print0 | xargs -0 md5sum | sort | uniq -Dw 32
Breaking the pipeline down:
find -name '*.mp3' -print0
finds all mp3 files in the subtree starting
at the current directory, printing the names NUL-separated.
xargs -0 md5sum
reads the NUL-separated list and computes a checksum
on each file.
You know what sort
does.
uniq -Dw 32
compares the first 32 characters of the sorted lines and
prints only the ones that have the same hash.
So you end up with a list of all duplicates. You can then whittle that
down manually to the ones you want to delete, remove the hashes, and
pipe the list to rm
.
I'm glad you got the job done with rdfind
.
Next time you could also consider rmlint
. It's extremely fast and offers a few different options to help determine which file is the original in each set of duplicates.