Idempotent counter implementation with Firebase Firestore

The answer to this question will depend upon different aspects of how you are using the collection, as well as what "perfectly count" means to you.

Preamble

To start with, since Cloud Function invocations are asynchronous with the write, it will result in the counter being slightly behind the true count of the collection. I'm assuming this is okay.

Even if you counted the collection by reading every document, the count could still be stale as documents might have been inserted or deleted while you were counting.

Costs

You mention "without being too expensive". Here we need to understand how often you read the count vs how often you add or remove documents. To maintain a separate counter, you'll be reading/writing it for every document count change. Since writes are 3x the price of reads, this means you'll need to be counting each document 4 or more times to recover the cost of keeping count. There's a formula in here somewhere that takes into account the average number of counts over the lifetime of a document, but I'll leave that as an exercise for the reader.

Idempotent counters

This is an interesting problem, which is another familiar one for distributed systems. If a client requests to add +1 a counter, and the request times out (server never response) - is it safe to request again? What if the server did apply the increment but then experienced a network issue? What if it didn't?

Below I'll answer some methods to deal with this situation.

Idempotent counters - Transaction Ids

One way to deal with this is to send a unique transaction id (txid) with the increment request. If the server has already processed the txid before, it knows it's a duplicate request and can respond that it's already done it.

In your use case, if you never delete documents, you could use the document id as the txid. In the counter, add the document id to an array of processed increments when you +1. Before you do this, check it doesn't already exist in the array (indicating it's already been processed).

One obvious problem with the above is the array will continue to grow, eventually getting too big. So, we'll want to limit how long we keep track of the old ids. You could use a timestamp and remove everything older than 'X', or you could simply treat array as a circular buffer to keep it a fixed maximum size.

Both these approaches are reasonable for slow write rates, but wouldn't suffice for faster writing. For example, at 1000 writes/second, that would be 5000 document ids just to cover 5 seconds (we mention in our limits documentation that a Function might take more than 5 seconds to execute).

Enter Forgetful Bloom Filters

Idempotent counters - Forgetful Bloom Filters

This method gives you much higher write rate support in exchange for a very small possibility of thinking you've seen a document id before.

I won't detail the implementation here, but there is an excellent overview of it in this blog: Counters, Idempotence And Forgetful Bloom Filters

Idempotent counters - Deleting

An additional complexity is handling deletions. If you use unique ids and are sure it won't be reused (e.g our native Auto id support), this isn't too hard to add. Just repeat what you did for additions but in a separate list/field and make sure you check both lists.

One minor thing to think about is Cloud Functions doesn't have guaranteed execution order. This means you might see a delete before an insert if they happen close enough together.

My recommendation is if you see a delete before an insert, do decrement the counter in advance knowing it'll be trued up soon, and if you see an insert after a delete, do the increment. This is because you only keep so much history, so you cannot tell if the insert & delete are out of order, or if the delete is just too far after the insert.

Other methods

Depending on collection size, how accurate it needs to be, as well as how often the count is used, you could periodically call a Cloud Function to compute the count and store it in a document. You could dynamically scale this based on the size of the collection to minimize the lag. For really small collections do it frequently, for larger collections to it more infrequently.

You can apply cost optimizations here as well if you have a mechanism to determine documents you've already counted (so you only need to count the new ones). If deletions are infrequent, you could add an event to decrement the counter on deletions.


Right now due to the lack of guarantees around the Cloud Firestore + Cloud Functions integration, the only way to be 100% sure your count is accurate is to read the entire collection each time you write the count.

As you said, that's not terribly efficient (in terms of speed or cost).

If you want to try and keep the count as each write comes in without repeatedly reading the entire collection, consider adding a counted boolean to each document.

Then when a document comes in, you do the following in a transaction:

  1. Read the document. If counted == true, exit
  2. Increment the count.
  3. Mark counted as true.

For more information on transactions in Cloud Firestore, see the docs: https://firebase.google.com/docs/firestore/manage-data/transactions