How to handle mongodb "schema" change in production
Extending @Michael Korbakov answer, I implemented his steps with mongo
shell script (see MongoDB Reference Manual about mongo
shell scripts).
Important: as stated in MongoDB Reference Manual, running script on mongo
shell can help performance because it reduces connection latency for each batch fetching and bulk execution.
A downside that should be considered is that mongo
shell commands are always synchronous, but bulk execution already takes care of parallelism (for each chunk) for us so we're good for this use case.
code:
// constants
var sourceDbName = 'sourceDb';
var sourceCollectionName = 'sourceColl';
var destDbName = 'destdb';
var destCollectionName = 'destColl';
var bulkWriteChunckSize = 1000;
// for fetching, I figured 1000 for current bulkWrite, and +1000 ready for next bulkWrite
var batchSize = 2000;
var sourceDb = db.getSiblingDB(sourceDbName);
var destDb = db.getSiblingDB(destDbName);
var start = new Date();
var cursor = sourceDb[sourceCollectionName].find({}).noCursorTimeout().batchSize(batchSize);
var currChunkSize = 0;
var bulk = destDb[destCollectionName].initializeUnorderedBulkOp();
cursor.forEach(function(doc) {
currChunkSize++;
bulk.insert({
...doc,
newProperty: 'hello!',
}); // can be changed for your need, if you want update instead
if (currChunkSize === bulkWriteChunckSize) {
bulk.execute();
// each bulk.execute took for me 130ms, so i figured to wait the same time as well
sleep(130);
currChunkSize = 0;
bulk = destDb[destCollectionName].initializeUnorderedBulkOp();
}
});
if (currChunkSize > 0) {
bulk.execute();
currChunkSize = 0;
}
var end = new Date();
print(end - start);
cursor.close();
One of the significant advantages of schema-less databases is that you don't have to update the entire database with new schema layouts. If some of the documents in the DB don't have particular information, then your code can do the appropriate thing instead, or elect to now do anything with that record.
Another option is to lazily update the documents as required - only when they are looked at again. In this instance, you might elect to have a per-record/document version flag - which initially may not even appear (and thus signify a 'version 0'). Even that is optional though. Instead, your database access code looks for data it requires, and if it does not exist, because it is new information, added after a code update, then it would fill in the results to the best of its ability.
For your example, converting an _id:false
into a standard MongoId
field, when the code is read (or written back after an update), and the _id:false
is currently set, then make the change and write it only when it is absolutely required.
You indeed have to write the script that will go over a collection and add a new field to each document. However exact way how you will do it depends on the size of your DB and performance of your storage system. Adding a field to the document will change its size and thus cause relocation in most of the cases. This operation has an impact on IO and also bounded by it. If your collection is just a few thousand documents, may be up to one hundred thousand, then you may just iterate over it in one loop because the whole collection probably fits into memory and all IO will happen afterward. However, if collection spans far beyond available memory, then the approach is more complicated. We usually follow next steps in production use of MongoDB:
- Open cursor with timeout=False
- Read a chunk of documents into memory
- Run update queries on these documents
- Sleep for some time to avoid overloading IO subsystem and hurting production application
- Repeat until done
- Close the cursor :)
Size of documents chunk and sleeping period must be determined experimentally. Usually, you want to avoid QR/QW in mongostats for the period of migration. For larger collections on slower drives (like EBS on Amazon) this IO-safe approach can take from hours to days.