Ways to implement data versioning in MongoDB
The first big question when diving in to this is "how do you want to store changesets"?
- Diffs?
- Whole record copies?
My personal approach would be to store diffs. Because the display of these diffs is really a special action, I would put the diffs in a different "history" collection.
I would use the different collection to save memory space. You generally don't want a full history for a simple query. So by keeping the history out of the object you can also keep it out of the commonly accessed memory when that data is queried.
To make my life easy, I would make a history document contain a dictionary of time-stamped diffs. Something like this:
{
_id : "id of address book record",
changes : {
1234567 : { "city" : "Omaha", "state" : "Nebraska" },
1234568 : { "city" : "Kansas City", "state" : "Missouri" }
}
}
To make my life really easy, I would make this part of my DataObjects (EntityWrapper, whatever) that I use to access my data. Generally these objects have some form of history, so that you can easily override the save()
method to make this change at the same time.
UPDATE: 2015-10
It looks like there is now a spec for handling JSON diffs. This seems like a more robust way to store the diffs / changes.
If you're looking for a ready-to-roll solution -
Mongoid has built in simple versioning
http://mongoid.org/en/mongoid/docs/extras.html#versioning
mongoid-history is a Ruby plugin that provides a significantly more complicated solution with auditing, undo and redo
https://github.com/aq1018/mongoid-history
There is a versioning scheme called "Vermongo" which addresses some aspects which haven't been dealt with in the other replies.
One of these issues is concurrent updates, another one is deleting documents.
Vermongo stores complete document copies in a shadow collection. For some use cases this might cause too much overhead, but I think it also simplifies many things.
https://github.com/thiloplanz/v7files/wiki/Vermongo
Here's another solution using a single document for the current version and all old versions:
{
_id: ObjectId("..."),
data: [
{ vid: 1, content: "foo" },
{ vid: 2, content: "bar" }
]
}
data
contains all versions. The data
array is ordered, new versions will only get $push
ed to the end of the array. data.vid
is the version id, which is an incrementing number.
Get the most recent version:
find(
{ "_id":ObjectId("...") },
{ "data":{ $slice:-1 } }
)
Get a specific version by vid
:
find(
{ "_id":ObjectId("...") },
{ "data":{ $elemMatch:{ "vid":1 } } }
)
Return only specified fields:
find(
{ "_id":ObjectId("...") },
{ "data":{ $elemMatch:{ "vid":1 } }, "data.content":1 }
)
Insert new version: (and prevent concurrent insert/update)
update(
{
"_id":ObjectId("..."),
$and:[
{ "data.vid":{ $not:{ $gt:2 } } },
{ "data.vid":2 }
]
},
{ $push:{ "data":{ "vid":3, "content":"baz" } } }
)
2
is the vid
of the current most recent version and 3
is the new version getting inserted. Because you need the most recent version's vid
, it's easy to do get the next version's vid
: nextVID = oldVID + 1
.
The $and
condition will ensure, that 2
is the latest vid
.
This way there's no need for a unique index, but the application logic has to take care of incrementing the vid
on insert.
Remove a specific version:
update(
{ "_id":ObjectId("...") },
{ $pull:{ "data":{ "vid":2 } } }
)
That's it!
(remember the 16MB per document limit)