Auto Scale Fargate Service Based On SQS ApproximateNumberOfMessagesVisible
Yes you can do this. You have to use a step scaling policy, and you need to have an alarm created already for your SQS queue depth (ApproximateNumberOfMessagesVisible).
Go to CloudWatch, create a new alarm. We'll call this alarm sqs-queue-depth-high, and have it trigger when the approximate number of messages visible is 1000.
With that done, go to ECS to the service you want to autoscale. Click Update for the service. Add a scaling policy and choose the Step Tracking variety. You'll see there's an option to create a new alarm (which only lets you choose between CPU or MemoryUtilization), or use an existing alarm.
Type sqs-queue-depth-high in the "Use existing alarm" field and press enter, you should see a green checkmark that lets you know the name is valid (i.e. the alarm exists). You'll see new dropdowns where you can adjust the step policy now.
This works for any metric alarm and ECS services. If you're going to be trying to scale this setup out, for multiple environments for example, or making it any more sophisticated than 2 steps, do yourself a favor and jump in with CloudFormation or Terraform to help manage it. Nothing is worse than having to adjust a 5-step alarm across 10 services.
AWS provides a solution for scaling based on SQS queue: https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-using-sqs-queue.html
Main idea
- Create a CloudWatch Custom Metric
sqs-backlog-per-task
using formula:sqs-backlog-per-task = sqs-messages-number / running-task-number
. - Create a Target Tracking Scaling Policy based on the
backlogPerInstance
metric.
Implementation details
Custom Metric
In my case all the infrastructure (Fargate, SQS, and other resources) is described in CloudFormation stack. So for calculating and logging the custom metric I decided to use AWS Lambda function which is also described in CloudFormation stack and deployed together with the entire infrastructure.
Below you can find code snippets for the AWS Lambda function for logging the following custom metrics:
sqs-backlog-per-task
- used for scalingrunning-task-number
- used for scaling optimization and debugging
AWS Lambda function described in AWS SAM syntax in CloudFormation stack (infrastructure.yml):
CustomMetricLoggerFunction:
Type: AWS::Serverless::Function
Properties:
FunctionName: custom-metric-logger
Handler: custom-metric-logger.handler
Runtime: nodejs8.10
MemorySize: 128
Timeout: 3
Role: !GetAtt CustomMetricLoggerFunctionRole.Arn
Environment:
Variables:
ECS_CLUSTER_NAME: !Ref Cluster
ECS_SERVICE_NAME: !GetAtt Service.Name
SQS_URL: !Ref Queue
Events:
Schedule:
Type: Schedule
Properties:
Schedule: 'cron(0/1 * * * ? *)' # every one minute
AWS Lambda Javascript code for calculating and logging (custom-metric-logger.js):
var AWS = require('aws-sdk');
exports.handler = async () => {
try {
var sqsMessagesNumber = await getSqsMessagesNumber();
var runningContainersNumber = await getRunningContainersNumber();
var backlogPerInstance = sqsMessagesNumber;
if (runningContainersNumber > 0) {
backlogPerInstance = parseInt(sqsMessagesNumber / runningContainersNumber);
}
await putRunningTaskNumberMetricData(runningContainersNumber);
await putSqsBacklogPerTaskMetricData(backlogPerInstance);
return {
statusCode: 200
};
} catch (err) {
console.log(err);
return {
statusCode: 500
};
}
};
function getSqsMessagesNumber() {
return new Promise((resolve, reject) => {
var data = {
QueueUrl: process.env.SQS_URL,
AttributeNames: ['ApproximateNumberOfMessages']
};
var sqs = new AWS.SQS();
sqs.getQueueAttributes(data, (err, data) => {
if (err) {
reject(err);
} else {
resolve(parseInt(data.Attributes.ApproximateNumberOfMessages));
}
});
});
}
function getRunningContainersNumber() {
return new Promise((resolve, reject) => {
var data = {
services: [
process.env.ECS_SERVICE_NAME
],
cluster: process.env.ECS_CLUSTER_NAME
};
var ecs = new AWS.ECS();
ecs.describeServices(data, (err, data) => {
if (err) {
reject(err);
} else {
resolve(data.services[0].runningCount);
}
});
});
}
function putRunningTaskNumberMetricData(value) {
return new Promise((resolve, reject) => {
var data = {
MetricData: [{
MetricName: 'running-task-number',
Value: value,
Unit: 'Count',
Timestamp: new Date()
}],
Namespace: 'fargate-sqs-service'
};
var cloudwatch = new AWS.CloudWatch();
cloudwatch.putMetricData(data, (err, data) => {
if (err) {
reject(err);
} else {
resolve(data);
}
});
});
}
function putSqsBacklogPerTaskMetricData(value) {
return new Promise((resolve, reject) => {
var data = {
MetricData: [{
MetricName: 'sqs-backlog-per-task',
Value: value,
Unit: 'Count',
Timestamp: new Date()
}],
Namespace: 'fargate-sqs-service'
};
var cloudwatch = new AWS.CloudWatch();
cloudwatch.putMetricData(data, (err, data) => {
if (err) {
reject(err);
} else {
resolve(data);
}
});
});
}
Target Tracking Scaling Policy
Then based on the sqs-backlog-per-task
metric I created Target Tracking Scaling Policy in my Cloud Formation template.
Target Tracking Scaling Policy based on the sqs-backlog-per-task
metric (infrastructure.yml):
ServiceScalingPolicy:
Type: AWS::ApplicationAutoScaling::ScalingPolicy
Properties:
PolicyName: service-scaling-policy
PolicyType: TargetTrackingScaling
ScalingTargetId: !Ref ServiceScalableTarget
TargetTrackingScalingPolicyConfiguration:
ScaleInCooldown: 60
ScaleOutCooldown: 60
CustomizedMetricSpecification:
Namespace: fargate-sqs-service
MetricName: sqs-backlog-per-task
Statistic: Average
Unit: Count
TargetValue: 2000
As a result AWS Application Auto Scaling creates and manages the CloudWatch alarms that trigger the scaling policy and calculates the scaling adjustment based on the metric and the target value. The scaling policy adds or removes capacity as required to keep the metric at, or close to, the specified target value. In addition to keeping the metric close to the target value, a target tracking scaling policy also adjusts to changes in the metric due to a changing load pattern.
I wrote a blog article about exactly this topic including a docker container to run it. The article can be found at: https://allaboutaws.com/how-to-auto-scale-aws-ecs-containers-sqs-queue-metrics
The prebuild container is available at DockerHub: https://hub.docker.com/r/sh39sxn/ecs-autoscaling-sqs-metrics
The files are available at GitHub: https://github.com/sh39sxn/ecs-autoscaling-sqs-metrics
I hope it helps you.
Update to 2021 (before maybe...)
For those who need it but in CDK
An example use case:
// Create the vpc and cluster used by the queue processing service
const vpc = new ec2.Vpc(stack, 'Vpc', { maxAzs: 2 });
const cluster = new ecs.Cluster(stack, 'FargateCluster', { vpc });
const queue = new sqs.Queue(stack, 'ProcessingQueue', {
QueueName: 'FargateEventQueue'
});
// Create the queue processing service
new QueueProcessingFargateService(stack, 'QueueProcessingFargateService', {
cluster,
image: ecs.ContainerImage.fromRegistry('amazon/amazon-ecs-sample'),
desiredTaskCount: 2,
maxScalingCapacity: 5,
queue
});
from:
https://github.com/aws/aws-cdk/blob/master/design/aws-ecs/aws-ecs-autoscaling-queue-worker.md