Image clustering by its similarity in python
It is a too broad question.
Generally speaking you can use any clustering mechanism, e.g. a popular k-means. To prepare your data for clustering you need to convert your collection into an array X, where every row is one example (image) and every column is a feature.
The main question - what your features should be. It is difficult to answer without knowing what you are trying to accomplish. If your images are small and of the same size you can simply have every pixel as a feature. If you have any metadata and would like to sort using it - you can have every tag in metadata as a feature.
Now if you really need to find some patterns between images you will have to apply an additional layer of processing, like convolutional neural network, which essentially allows you to extract features from different pieces of your image. You can think about it as a filter, which will convert every image into, say 8x8 matrix, which then correspondingly could be used as a row with 64 different features in your array X for clustering.
I had the same problem and I came up with this solution:
- Import a pretrained model using Keras (here VGG16)
- Extract features per image
- Do kmeans
- Export by copying with cluster label
Here is my code, partly motivated by this post.
from keras.preprocessing import image
from keras.applications.vgg16 import VGG16
from keras.applications.vgg16 import preprocess_input
import numpy as np
from sklearn.cluster import KMeans
import os, shutil, glob, os.path
from PIL import Image as pil_image
image.LOAD_TRUNCATED_IMAGES = True
model = VGG16(weights='imagenet', include_top=False)
# Variables
imdir = 'C:/indir/'
targetdir = "C:/outdir/"
number_clusters = 3
# Loop over files and get features
filelist = glob.glob(os.path.join(imdir, '*.jpg'))
filelist.sort()
featurelist = []
for i, imagepath in enumerate(filelist):
print(" Status: %s / %s" %(i, len(filelist)), end="\r")
img = image.load_img(imagepath, target_size=(224, 224))
img_data = image.img_to_array(img)
img_data = np.expand_dims(img_data, axis=0)
img_data = preprocess_input(img_data)
features = np.array(model.predict(img_data))
featurelist.append(features.flatten())
# Clustering
kmeans = KMeans(n_clusters=number_clusters, random_state=0).fit(np.array(featurelist))
# Copy images renamed by cluster
# Check if target dir exists
try:
os.makedirs(targetdir)
except OSError:
pass
# Copy with cluster name
print("\n")
for i, m in enumerate(kmeans.labels_):
print(" Copy: %s / %s" %(i, len(kmeans.labels_)), end="\r")
shutil.copy(filelist[i], targetdir + str(m) + "_" + str(i) + ".jpg")
Update 02/2022:
In some cases (e.g. unknown number of clusters) using Affinity Propagation may be a much better choice than kmeans
. In this case, replace kmeans
by:
from sklearn.cluster import AffinityPropagation
affprop = AffinityPropagation(affinity="euclidean", damping=0.5).fit(np.array(featurelist))
and loop over affprop.labels_
to access the results.