Why is PyMongo count_documents is slower than count?
As already mentioned here, the behavior is not specific to PyMongo.
The reason is because the count_documents
method in PyMongo performs an aggregation query and does not use any metadata. see collection.py#L1670-L1688
pipeline = [{'$match': filter}]
if 'skip' in kwargs:
pipeline.append({'$skip': kwargs.pop('skip')})
if 'limit' in kwargs:
pipeline.append({'$limit': kwargs.pop('limit')})
pipeline.append({'$group': {'_id': None, 'n': {'$sum': 1}}})
cmd = SON([('aggregate', self.__name),
('pipeline', pipeline),
('cursor', {})])
if "hint" in kwargs and not isinstance(kwargs["hint"], string_type):
kwargs["hint"] = helpers._index_document(kwargs["hint"])
collation = validate_collation_or_none(kwargs.pop('collation', None))
cmd.update(kwargs)
with self._socket_for_reads(session) as (sock_info, slave_ok):
result = self._aggregate_one_result(
sock_info, slave_ok, cmd, collation, session)
if not result:
return 0
return result['n']
This command has the same behavior as the collection.countDocuments
method.
That being said, if you willing to trade accuracy for performance, you can use the estimated_document_count
method which on the other hand, send a count
command to the database with the same behavior as collection.estimatedDocumentCount
See collection.py#L1609-L1614
if 'session' in kwargs:
raise ConfigurationError(
'estimated_document_count does not support sessions')
cmd = SON([('count', self.__name)])
cmd.update(kwargs)
return self._count(cmd)
Where self._count
is a helper sending the command.
This is not about PyMongo but Mongo itself.
count
is a native Mongo function. It doesn't really count all the documents. Whenever you insert or delete a record in Mongo, it caches the total number of records in the collection. Then when you run count
, Mongo will return that cached value.
count_documents
uses a query object, which means that it has to loop through all the records in order to get the total count. Because you're not passing any parameters, it will have to run over all 60 million records. This is why it is slow.
based on @Stennie comment
You can use estimated_document_count() in PyMongo 3.7+ to return the fast count based on collection metadata. The original count() was deprecated because the behaviour differed (estimated vs actual count) based on whether query criteria was provided. The newer driver API is more intentional about the outcome