How can I implement versioning without replacing with previous record in DynamoDB?

I've been experimenting and calculating what's most efficient in terms of read/write units and cost, considering race conditions where updates occur while a version is being logged, and avoiding of duplication of data. I've narrowed down a couple of possible solutions. You'd have to consider your best variation.

The basic concepts revolve around consider version 0 as the latest version. Also, we will use a revisions key that will list how many revisions exist before this item, but also will be used to determine the current version of the item (version = revisions + 1). Being able to calculate how versions exists is a requirement and, in my opinion, revisions fulfills that need as well as a value that can be presented to the user.

So the first row will be created with version: 0 and revisions: 0. While this is technically the first version (v1), we don't apply a version number until it's archived. When this row changes, version stays at 0, which still denotes latest, and revisions is incremented to 1. A new row is created with all the previous values, with the exception that now that row denotes version: 1.

To summarize:

On item creation:

  • Create item with revisions: 0 and version 0

On item update or overwrite:

  • Increment revisions
  • Insert old row exactly as before, but change version: 0 to the new version which can easily be calculated as version: revisions + 1.

Here's what converting to convertion would look like on a table with just primary key:

Primary Key: id

  id  color
9501  violet
9502  cyan
9503  magenta

Primary Key: id+version

id    version  revisions  color
9501        0          6  violet
9501        1          0  red
9501        2          1  orange
9501        3          2  yellow
9501        4          3  green
9501        5          4  blue
9501        6          5  indigo

Here's converting a table that already uses a sort key:

Primary key: id+date

id    date     color
9501  2018-01  violet
9501  2018-02  cyan
9501  2018-03  black

Primary key: id+date_ver

id    date_ver     revisions  color
9501  2018-01__v0          6  violet
9501  2018-01__v1          0  red
9501  2018-01__v2          1  orange
9501  2018-01__v3          2  yellow
9501  2018-01__v4          3  green
9501  2018-01__v5          4  blue
9501  2018-01__v6          5  indigo

Alternative #2:

id    date_ver     revisions  color
9501  2018-01              6  violet
9501  2018-01__v1          0  red
9501  2018-01__v2          1  orange
9501  2018-01__v3          2  yellow
9501  2018-01__v4          3  green
9501  2018-01__v5          4  blue
9501  2018-01__v6          5  indigo

We actually have the option of either putting previous versions in the same table or separate them into their own table. Both options have their different advantages and disadvantages.

Using the same table:

  • Primary key is composed of Partition key and Sort key
  • Version must be used in sort key either alone as a number or appended to an existing sort key as a string

Advantages:

Disadvantages:

  • Possibly limits your use of the table sort keys
  • Versioning uses the same writes units as your primary table
  • Sort keys can only be configured during table creation
  • Possibly need to readjust your code to query against v0
  • Previous versions will also be affected by indexes

Using a secondary tables:

  • Add revision key to both tables
  • If not using sort key, build a sort key for secondary table called version. The primary table would always have version: 0. Use of this key on the primary table isn't mandatory.
  • If already using a sort key, see "Alternative #2" above

Advantages:

  • Primary table does not need to change any keys or be recreated. get requests do not change.
  • Primary table keeps its sort key
  • Secondary table can have independent read and write capacity units
  • Secondary table has its own indexes

Disadvantages:

  • Requires managing of a second table

Regardless of how you decide to partition the data, now we have to decide how create the revision rows. Here are a couple of different methods:

On-demand, synchronous item-overwrite/update and revision-insert

Summary: Get the current version of the row. Perform both an update on the current row and insert the previous version with one transaction.

To avoid race conditions, we need to write both the update and insert in the same operation using TransactWriteItems. Also, we need to make sure that the version we are updating is the right version by the time the request reaches database server. We achieve this by either one of two checks, or even both:

  1. In the Update command in TransactItems, the ConditionExpression must check that the revision in the row to be updated matches the revision in the object we performed a Get on before.
  2. In the Put command in TransactItems, the ConditionExpression checks to make sure the row does not already exist.

Cost

  • 1 Read capacity unit per 4K for Get on v0
  • 1 Write capacity unit for preparing TransactWriteItem
  • 1 Write capacity unit per 1K for Put/Update on v0
  • 1 Write capacity unit per 1K for Put on revision
  • 1 Write capacity unit for committing TransactWriteItem

Notes:

  • Items are limited to 400KB

On-demand, asynchronous item-get, item-overwrite/update, and revision-insert

Summary: Get and store current row. When overwriting or updating a row, check against current revision and increment revisions. Insert previously stored row with version number.

Perform an update with

{
  UpdateExpression: 'SET revisions = :newRevisionCount',
  ExpressionAttributeValues: {
    ':newRevisionCount': previousRow.revisions + 1,
    ':expectedRevisionCount': previousRow.revisions,
  },
  ConditionExpression: 'revisions = :expectedRevisionCount',
}

We can use the same ConditionExpression with put when overwriting a previously existing row.

In the response, we are watching for ConditionalCheckFailedException. If this is returned, that means the revision was already changed by another process and we need to repeat the process from the beginning or abort entirely. If there are no exceptions, then we can insert the previous stored row after update the value on your version attribute as appropriate (numeric or string).

Cost

  • 1 read capacity unit per 4K for Get on v0
  • 1 write capacity unit per 1KB for Put/UpdateItem on v0
  • 1 write capacity unit per 1KB for Put on revision

On-demand, asynchronous blind item-update and revision-insert

Summary: Perform a "blind" update on the v0 row while incrementing revisions and requesting the old attributes. Use the return value to create a new row with the version number.

Perform an update-item with

{
  UpdateExpression: 'ADD revisions :revisionIncrement',
  ExpressionAttributeValues: {
    ':revisionIncrement': 1,
  },
  ReturnValues: 'ALL_OLD',
}

The ADD action will automatically create revisions if it doesn't exist and will consider it 0. One nice benefit of ReturnValues is:

There is no additional cost associated with requesting a return value aside from the small network and processing overhead of receiving a larger response. No read capacity units are consumed.

In the update response, the Attributes value will be the data from the old record. The version of this record is the value of Attributes.revisions + 1. Update the value on your version attribute as appropriate (numeric or string).

Now you can insert this record into your target table.

Cost

  • 1 write capacity unit per 1KB for Update on v0
  • 1 write capacity unit per 1KB for Put on revision

Notes:

  • The returned object's Attributes length is limited to 65535.
  • No solution for overwriting rows.

Automated asynchronous revision-insert

Summary: Perform "blind" updates and inserts on the primary while incrementing revisions. Use a Lambda trigger watching for changes to revision to insert revisions asynchronously.

Perform an update with

{
  UpdateExpression: 'ADD revisions :revisionIncrement',
  ExpressionAttributeValues: {
    ':revisionIncrement': 1,
  },
}

The ADD action will automatically create revisions if it doesn't exist and will consider it 0.

For overwriting records with put increment revisions value based on a previous get request.

Configure a DynamoDB Stream view type to return both new and old images. Setup a Lambda trigger against the database table. Here's a piece of sample code for NodeJS that would compare the old and new images and call a function to write the revisions in batch.

/**
 * @param {AWSLambda.DynamoDBStreamEvent} event
 * @return {void}
 */
export function handler(event) {
  const oldRevisions = event.Records
    .filter(record => record.dynamodb.OldImage
      && record.dynamodb.NewImage
      && record.dynamodb.OldImage.revision.N !== record.dynamodb.NewImage.revision.N)
    .map(record => record.dynamodb.OldImage);
  batchWriteRevisions(oldRevisions);
}

This is just sample, but production code would likely include more checks.

Cost

  • 1 read capacity unit per 4K for get on v0 (only when overwriting)
  • 1 write capacity unit per 1KB for Put/Update on v0
  • 1 DynamoDB Stream read request unit per GetRecords command
  • 1 write capacity unit per 1KB for Put of revision

Notes:

  • DynamoDB Stream shard data expires after 24 hours
  • DynamoDB Stream read request units are independent of table read capacity units
  • Use of Lambda functions has its own pricing
  • Changing stream view type requires disabling and re-enabling the stream
  • Works with Write, Put, BatchWriteItems, TransactWriteItems commands

For my use cases, I'm already using a DynamoDB Streams and I don't expect users to request versioned rows all that often. I also can let users wait a bit for the revisions to be ready since it's asynchronous. That makes using a second table and the automated lambda process the more ideal solution for me.

For the asynchronous options there are some points of failure. Still, it's something that you can either retry immediately on the on-demand requests, or schedule for later for the DynamoDB Stream solution.

If anybody has any other solutions or critiques, please comment. Thanks!


I don't think the DynamoDB service currently support row versioning natively. If you want the versioning functionality you will need to do it on your side.

In DynamoDB a row is uniquely identified by its primary key. The primary key could be either HashKey-only or HashKey+RangeKey. If you want to differentiate the same row with different versions, you need to include the version number somewhere in your primary key.

For example you can append the version number to the end of your hashkey for all the old versions of a row. The row with the latest version will use the original hashkey.

Hash    Attr   Version
hey      a2     2
hey_v1   a1     1

after update the row to version 3 the table should look like this:

Hash    Attr   Version
hey      a3      3
hey_v1   a1      1
hey_v2   a2      2

Doing versioning on the client side is always not perfect. for example, for the above approach, if you do a scan you will get hey_V1 and hey_v2 also. please let me know if this works for you or not. If you have better way to do versioning on client side please also post here.


You can also achieve this by maintaining two separate tables. One for just the latest items, and another for their versions. I wrote a blogpost with a detailed explanation https://www.efekarakus.com/2018/05/25/client-side-row-versioning-in-dynamo-db.html

The resource table, where hash is the primary key.

      +----------+---------+-------------------+
      |   hash   | version |   attr1..attrN    |
      +----------+---------+-------------------+
      | 1c5815b2 |    2    |  some values      |
      +----------+---------+-------------------+

The resource-history table, where hash is the partition key and version the sort key.

      +----------+---------+-------------------+
      |   hash   | version |   attr1..attrN    |
      +----------+---------+-------------------+
      | 1c5815b2 |    2    |  some values      |
      +----------+---------+-------------------+
      | 1c5815b2 |    1    |  some old values  |
      +----------+---------+-------------------+

The important part is that any action that changes the record should increment its version number.

When you create or update the resource, write first to the resource-history table and then to the resource table.

I found this to be slightly cleaner, because you won't run into potential data loss scenarios like you could while working on immutable data on a single table.