Remove duplicates in a Django query
This query will not give you duplicates - ie, it will give you all the rows in the database, ordered by email.
However, I presume what you mean is that you have duplicate data within your database. Adding distinct()
here won't help, because even if you have only one field, you also have an automatic id
field - so the combination of id+email is not unique.
Assuming you only need one field, email_address
, de-duplicated, you can do this:
email_list = Email.objects.values_list('email', flat=True).distinct()
However, you should really fix the root problem, and remove the duplicate data from your database.
Example, deleting duplicate Emails by email field:
for email in Email.objects.values_list('email', flat=True).distinct():
Email.objects.filter(pk__in=Email.objects.filter(email=email).values_list('id', flat=True)[1:]).delete()
Or books by name:
for name in Book.objects.values_list('name', flat=True).distinct():
Book.objects.filter(pk__in=Artwork.objects.filter(name=name).values_list('id', flat=True)[3:]).delete()
You can chain .distinct()
on the end of your queryset to filter duplicates. Check out: http://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.distinct
For checking duplicate you can do a GROUP_BY
and HAVING
in Django
as below. We are using Django annotations
here.
from django.db.models import Count
from app.models import Email
duplicate_emails = Email.objects.values('email').annotate(email_count=Count('email')).filter(email_count__gt=1)
Now looping through the above data and deleting all other emails
except the first one (depends on requirement or whatever).
for data in duplicates_emails:
email = data['email']
Email.objects.filter(email=email).order_by('pk')[1:].delete()