Find all duplicate documents in a MongoDB collection by a key field

Note: this solution is the easiest to understand, but not the best.

You can use mapReduce to find out how many times a document contains a certain field:

var map = function(){
   if(this.name) {
        emit(this.name, 1);
   }
}

var reduce = function(key, values){
    return Array.sum(values);
}

var res = db.collection.mapReduce(map, reduce, {out:{ inline : 1}});
db[res.result].find({value: {$gt: 1}}).sort({value: -1});

For a generic Mongo solution, see the MongoDB cookbook recipe for finding duplicates using group. Note that aggregation is faster and more powerful in that it can return the _ids of the duplicate records.

For pymongo, the accepted answer (using mapReduce) is not that efficient. Instead, we can use the group method:

$connection = 'mongodb://localhost:27017';
$con        = new Mongo($connection); // mongo db connection

$db         = $con->test; // database 
$collection = $db->prb; // table

$keys       = array("name" => 1); Select name field, group by it

// set intial values
$initial    = array("count" => 0);

// JavaScript function to perform
$reduce     = "function (obj, prev) { prev.count++; }";

$g          = $collection->group($keys, $initial, $reduce);

echo "<pre>";
print_r($g);

Output will be this :

Array
(
    [retval] => Array
        (
            [0] => Array
                (
                    [name] => 
                    [count] => 1
                )

            [1] => Array
                (
                    [name] => MongoDB
                    [count] => 2
                )

        )

    [count] => 3
    [keys] => 2
    [ok] => 1
)

The equivalent SQL query would be: SELECT name, COUNT(name) FROM prb GROUP BY name. Note that we still need to filter out elements with a count of 0 from the array. Again, refer to the MongoDB cookbook recipe for finding duplicates using group for the canonical solution using group.

The accepted answer is terribly slow on large collections, and doesn't return the _ids of the duplicate records.

Aggregation is much faster and can return the _ids:

db.collection.aggregate([
  { $group: {
    _id: { name: "$name" },   // replace `name` here twice
    uniqueIds: { $addToSet: "$_id" },
    count: { $sum: 1 } 
  } }, 
  { $match: { 
    count: { $gte: 2 } 
  } },
  { $sort : { count : -1} },
  { $limit : 10 }
]);

In the first stage of the aggregation pipeline, the $group operator aggregates documents by the name field and stores in uniqueIds each _id value of the grouped records. The $sum operator adds up the values of the fields passed to it, in this case the constant 1 - thereby counting the number of grouped records into the count field.

In the second stage of the pipeline, we use $match to filter documents with a count of at least 2, i.e. duplicates.

Then, we sort the most frequent duplicates first, and limit the results to the top 10.

This query will output up to $limit records with duplicate names, along with their _ids. For example:

{
  "_id" : {
    "name" : "Toothpick"
},
  "uniqueIds" : [
    "xzuzJd2qatfJCSvkN",
    "9bpewBsKbrGBQexv4",
    "fi3Gscg9M64BQdArv",
  ],
  "count" : 3
},
{
  "_id" : {
    "name" : "Broom"
  },
  "uniqueIds" : [
    "3vwny3YEj2qBsmmhA",
    "gJeWGcuX6Wk69oFYD"
  ],
  "count" : 2
}

Find all duplicate documents in a MongoDB collection by a key field

Tags:

Duplicates

Mapreduce

Mongodb

Aggregation Framework

Related

Recent Posts