Is it possible to detect duplicate image files?

As a sidenote, for images, I find raster data hashes to be far more effective than file hashes.

ImageMagick provides reliable way to compute such hashes, and there are different bindings for python available. It helps to detect same images with different lossless compressions and different metadata.

Usage example:

>>> import PythonMagick
>>> img = PythonMagick.Image("image.png")
>>> img.signature()
'e11cfe58244d7cf98a79bfdc012857a9391249dca3aedfc0fde4528eed7f7ba7'

I wrote a script for this a while back. First it scans all files, noting their sizes in a dictionary. You endup with:

images[some_size] = ['x/a.jpg', 'b/f.jpg', 'n/q.jpg']
images[some_other_size] = ['q/b.jpg']

Then, for each key (image size) where there's more than 1 element in the dictionary, I'd read some fixed amount of the file and do a hash. Something like:

possible_dupes = [size for size in images if len(images[size]) > 1]
for size in possible_dupes:
    hashes = defaultdict(list)
    for fname in images[size]:
        m = md5.new()
        hashes[ m.update( file(fname,'rb').read(10000) ).digest() ] = fname
    for k in hashes:
       if len(hashes[k]) <= 1: continue
       for fname in hashes[k][1:]:
           os.remove(fname)

This is all off the top of my head, haven't tested the code, but you get the idea.


Assuming you are talking about same images in terms of same image data.

Compute the hash of the "no image" image and compare it to the hashes of the other images. If the hashes are the same, it is the same file.

Tags:

Python

Image