How to use Data Pipeline to export a DynamoDB table that has on-demand provision
Support for on-demand tables was added to the DDB export tool earlier this year: GitHub commit
I was able to put a newer build of the tool on S3 and update a few things in the pipeline to get it working:
{
"objects": [
{
"output": {
"ref": "S3BackupLocation"
},
"input": {
"ref": "DDBSourceTable"
},
"maximumRetries": "2",
"name": "TableBackupActivity",
"step": "s3://<your-tools-bucket>/emr-dynamodb-tools-4.11.0-SNAPSHOT.jar,org.apache.hadoop.dynamodb.tools.DynamoDBExport,#{output.directoryPath},#{input.tableName},#{input.readThroughputPercent}",
"id": "TableBackupActivity",
"runsOn": {
"ref": "EmrClusterForBackup"
},
"type": "EmrActivity",
"resizeClusterBeforeRunning": "true"
},
{
"failureAndRerunMode": "CASCADE",
"resourceRole": "DataPipelineDefaultResourceRole",
"role": "DataPipelineDefaultRole",
"pipelineLogUri": "s3://<your-log-bucket>/",
"scheduleType": "ONDEMAND",
"name": "Default",
"id": "Default"
},
{
"readThroughputPercent": "#{myDDBReadThroughputRatio}",
"name": "DDBSourceTable",
"id": "DDBSourceTable",
"type": "DynamoDBDataNode",
"tableName": "#{myDDBTableName}"
},
{
"directoryPath": "#{myOutputS3Loc}/#{format(@scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}",
"name": "S3BackupLocation",
"id": "S3BackupLocation",
"type": "S3DataNode"
},
{
"name": "EmrClusterForBackup",
"coreInstanceCount": "1",
"coreInstanceType": "m3.xlarge",
"releaseLabel": "emr-5.26.0",
"masterInstanceType": "m3.xlarge",
"id": "EmrClusterForBackup",
"region": "#{myDDBRegion}",
"type": "EmrCluster",
"terminateAfter": "1 Hour"
}
],
"parameters": [
{
"description": "Output S3 folder",
"id": "myOutputS3Loc",
"type": "AWS::S3::ObjectKey"
},
{
"description": "Source DynamoDB table name",
"id": "myDDBTableName",
"type": "String"
},
{
"default": "0.25",
"watermark": "Enter value between 0.1-1.0",
"description": "DynamoDB read throughput ratio",
"id": "myDDBReadThroughputRatio",
"type": "Double"
},
{
"default": "us-east-1",
"watermark": "us-east-1",
"description": "Region of the DynamoDB table",
"id": "myDDBRegion",
"type": "String"
}
],
"values": {
"myDDBRegion": "us-west-2",
"myDDBTableName": "<your table name>",
"myDDBReadThroughputRatio": "0.5",
"myOutputS3Loc": "s3://<your-output-bucket>/"
}
}
Key changes:
- Update the releaseLabel of
EmrClusterForBackup
to "emr-5.26.0". This is needed to get v1.11 of the AWS SDK for Java and v4.11.0 of the DynamoDB connector (see the release matrix here: AWS docs) - Update the step of
TableBackupActivity
as above. Point it to your build of the *.jar, and update the class name of the tool fromDynamoDbExport
toDynamoDBExport
Hopefully the default template gets updated as well so it just works out of the box.
I opened a support ticket with AWS on this. Their response was pretty comprehensive. I will paste it below
Thanks for reaching out regarding this issue.
Unfortunately, Data Pipeline export/ import jobs for DynamoDB do not support DynamoDB's new On-Demand mode [1].
Tables using On-Demand capacity do not have defined capacities for Read and Write units. Data Pipeline relies on this defined capacity when calculating the throughput of the pipeline.
For example, if you have 100 RCU (Read Capacity Units) and a pipeline throughput of 0.25 (25%), the effective pipeline throughput would be 25 read units per second (100 * 0.25). However, in the case of On-Demand capacity, the RCU and WCU (Write Capacity Units) are reflected as 0. Regardless of the pipeline throughput value, the calculated effective throughput is 0.
The pipeline will not execute when the effective throughput is less than 1.
Are you required to export DynamoDB tables to S3?
If you are using these table exports for backup purposes only, I recommend using DynamoDB's On-Demand Backup and Restore feature (a confusingly similar name to On-Demand capacity) [2].
Note that On-Demand Backups do not impact the throughput of your table, and are completed in seconds. You only pay for the S3 storage costs associated with the backups. However, these table backups are not directly accessible to customers, and can only be restored to the source table. This method of backups is not suitable if you wish to perform analytics on the backup data, or import the data into other systems, accounts or tables.
If you need to use Data Pipeline to export DynamoDB data, then the only way forward is to set the table(s) to Provisioned capacity mode.
You could do this manually, or include it as an activity in the pipeline itself, using an AWS CLI command [3].
For example (On-Demand is also referred to as Pay Per Request mode):
$ aws dynamodb update-table --table-name myTable --billing-mode PROVISIONED --provisioned-throughput ReadCapacityUnits=100,WriteCapacityUnits=100
-
$ aws dynamodb update-table --table-name myTable --billing-mode PAY_PER_REQUEST
Note that after disabling On-Demand capacity mode, you need to wait for 24 hours before you can enable it again.
=== Reference Links ===
[1] DynamoDB On-Demand capacity (also refer to the note on unsupported services/ tools): https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadWriteCapacityMode.html#HowItWorks.OnDemand
[2] DynamoDB On-Demand Backup and Restore: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/BackupRestore.html
[3] AWS CLI reference for DynamoDB "update-table": https://docs.aws.amazon.com/cli/latest/reference/dynamodb/update-table.html