Delete Duplicate Rows in Django DB

Here is a fast solution:

from django.db import connection

query = "SELECT id FROM table_name GROUP BY unique_column HAVING COUNT(unique_column)>1"
cursor = connection.cursor()
cursor.execute(query)
ids_list = [item[0] for item in cursor.fetchall()]

now you can do:

Some_Model.objects.filter(id__in=ids_list).delete()

or if ids_list was too huge to be handled by your dbms

you can segment it to chunks that can be handled by it:

seg_length = 100
ids_lists = [ids_list[x:x+seg_length] for x in range(0,len(ids_list),seg_length)]
for ids_list in ids_lists:
    SomeModel.objects.filter(id__in=ids_list).delete()

This may be faster because it avoids the inner filter for each row in MyModel.

Since the ids are unique, if the models are sorted by them in increasing order, we can keep track of the last id we saw and as we walk over the rows if we see a model with the same id, it must be a duplicate, so we can delete it.

lastSeenId = float('-Inf')
rows = MyModel.objects.all().order_by('photo_id')

for row in rows:
  if row.photo_id == lastSeenId:
    row.delete() # We've seen this id in a previous row
  else: # New id found, save it and check future rows for duplicates.
    lastSeenId = row.photo_id 

The simplest way is the simplest way! Especially for one off scripts where performance doesn't even matter (unless it does). Since it's not core code, I'd just write the first thing that comes to mind and works.

# assuming which duplicate is removed doesn't matter...
for row in MyModel.objects.all().reverse():
    if MyModel.objects.filter(photo_id=row.photo_id).count() > 1:
        row.delete()

Use .reverse() to delete the duplicates first and keep the first instance of it, rather than the last.

As always, back up before you do this stuff.

Tags:

Python

Django