Find all duplicate documents in a MongoDB collection by a key field
Note: this solution is the easiest to understand, but not the best.
You can use mapReduce
to find out how many times a document contains a certain field:
var map = function(){
if(this.name) {
emit(this.name, 1);
}
}
var reduce = function(key, values){
return Array.sum(values);
}
var res = db.collection.mapReduce(map, reduce, {out:{ inline : 1}});
db[res.result].find({value: {$gt: 1}}).sort({value: -1});
For a generic Mongo solution, see the MongoDB cookbook recipe for finding duplicates using group
. Note that aggregation is faster and more powerful in that it can return the _id
s of the duplicate records.
For pymongo, the accepted answer (using mapReduce) is not that efficient. Instead, we can use the group method:
$connection = 'mongodb://localhost:27017';
$con = new Mongo($connection); // mongo db connection
$db = $con->test; // database
$collection = $db->prb; // table
$keys = array("name" => 1); Select name field, group by it
// set intial values
$initial = array("count" => 0);
// JavaScript function to perform
$reduce = "function (obj, prev) { prev.count++; }";
$g = $collection->group($keys, $initial, $reduce);
echo "<pre>";
print_r($g);
Output will be this :
Array
(
[retval] => Array
(
[0] => Array
(
[name] =>
[count] => 1
)
[1] => Array
(
[name] => MongoDB
[count] => 2
)
)
[count] => 3
[keys] => 2
[ok] => 1
)
The equivalent SQL query would be: SELECT name, COUNT(name) FROM prb GROUP BY name
. Note that we still need to filter out elements with a count of 0 from the array. Again, refer to the MongoDB cookbook recipe for finding duplicates using group
for the canonical solution using group
.
The accepted answer is terribly slow on large collections, and doesn't return the _id
s of the duplicate records.
Aggregation is much faster and can return the _id
s:
db.collection.aggregate([
{ $group: {
_id: { name: "$name" }, // replace `name` here twice
uniqueIds: { $addToSet: "$_id" },
count: { $sum: 1 }
} },
{ $match: {
count: { $gte: 2 }
} },
{ $sort : { count : -1} },
{ $limit : 10 }
]);
In the first stage of the aggregation pipeline, the $group
operator aggregates documents by the name
field and stores in uniqueIds
each _id
value of the grouped records.
The $sum operator adds up the values of the fields passed to it, in this case the constant 1
- thereby counting the number of grouped records into the count
field.
In the second stage of the pipeline, we use $match
to filter documents with a count
of at least 2, i.e. duplicates.
Then, we sort the most frequent duplicates first, and limit the results to the top 10.
This query will output up to $limit
records with duplicate names, along with their _id
s. For example:
{
"_id" : {
"name" : "Toothpick"
},
"uniqueIds" : [
"xzuzJd2qatfJCSvkN",
"9bpewBsKbrGBQexv4",
"fi3Gscg9M64BQdArv",
],
"count" : 3
},
{
"_id" : {
"name" : "Broom"
},
"uniqueIds" : [
"3vwny3YEj2qBsmmhA",
"gJeWGcuX6Wk69oFYD"
],
"count" : 2
}