pymongo: remove duplicates (map reduce?)
First, I tried a very manual approach with lists and deleting afterwards, but the DB seems too big, takes very long and is not practical.
The best bet is using the .aggregate()
method which provides access to the aggregation pipeline to find those documents that are duplicate. The first stage in the pipeline is the $group
stage where you group your documents by the duplicated key then use the $push
and $sum
accumulator operators which respectively return an array of all _id
for each group and the count of elements in the group. The next and last stage in the pipeline is the $match
stage to return only those result where there is duplicate "ID". From there you then iterate the cursor and update each document using "bulk" operations.
pipeline = [{'$group': {'_id': '$ID', 'count': {'$sum': 1}, 'ids': {'$push': '$_id'}}},
{'$match': {'count': {'$gte': 2}}}]
bulk = db.collection.initialize_ordered_bulk_op()
count = 0
for document in db.collection.aggregate(pipeline):
it = iter(document['ids'])
next(it)
for id in it:
bulk.find({'_id': id}).remove_one({'_id': id})
count = count + 1
if count % 1000 == 0:
bulk.execute()
if count > 0:
bulk.execute()
MongoDB 3.2 deprecates Bulk()
and its associated methods so you will need to use the bulk_write()
method to execute your request.
from pymongo import DeleteOne
request = []
for document in db.collection.aggregate(pipeline):
it = iter(document['ids'])
next(it)
for id in it:
requests.append(DeleteOne({'_id': id}))
db.collection.bulk_write(requests)
You can also do this in the shell as shown in the accepted answers to remove dups from mongodb and How to remove duplicates with a certain condition in mongodb?
My solution was also using aggregate. You pick the field which is replicated for the aggregation. The result will be a list of collection of duplicates. Every position will hold together a group of duplicates. You interate over the list ignoring the first element of each every group to preserve it, and delete the rest. You do it for every collection of dupliates. See bellow:
replic = db.<YOUR_COLLECTION>.aggregate([ # Cursor with all duplicated documents
{'$group': {
'_id': {'<FIELD_DUPLICATED>': '$<FIELD_DUPLICATED>'}, # Duplicated field
'idsUnicos': {'$addToSet': '$_id'},
'total': {'$sum': 1}
}
},
{'$match': {
'total': {'$gt': 1} # Holds how many duplicates for each group, if you need it.
}
}
])
# Result is a list of lists of ObjectsIds
for i in replic:
for idx, j in enumerate(i['idsUnicos']): # It holds the ids of all duplicates
if idx != 0: # Jump over first element to keep it
<YOUR_COLLECTION>.delete_one({'_id': j}) # Remove the rest
You can try "delete_many" to improve performance.
An alternative approach is to use the aggregation framework
which has better performance than map-reduce. Consider the following aggregation pipeline which as the first stage of the aggregation pipeline, the $group
operator groups documents by the ID
field and stores in the unique_ids
field each _id
value of the grouped records using the $addToSet
operator. The $sum
accumulator operator adds up the values of the fields passed to it, in this case the constant 1 - thereby counting the number of grouped records into the count field. The other pipeline step $match
filters documents with a count of at least 2, i.e. duplicates.
Once you get the result from the aggregation, you iterate the cursor to remove the first _id
in the unique_ids
field, then push the rest into an array that will be used later to remove the duplicates (minus one entry):
cursor = db.coll.aggregate(
[
{"$group": {"_id": "$ID", "unique_ids": {"$addToSet": "$_id"}, "count": {"$sum": 1}}},
{"$match": {"count": { "$gte": 2 }}}
]
)
response = []
for doc in cursor:
del doc["unique_ids"][0]
for id in doc["unique_ids"]:
response.append(id)
coll.remove({"_id": {"$in": response}})