AWS, SQS trigger to Lambda is automatically disabled when Lambda fails
I'm not sure why the Lambda stops working. I suspect Lambda service notices that it keeps failing so it temporarily suspends it. Not sure.
You can try a number of workarounds:
- Use DynamoDB on demand capacity - AWS says it scales instantly.
- Alternatively if you use provisioned capacity and get the Provisioned Throughput Exception don't actually abort the Lambda execution but instead re-insert the message to the SQS queue and exit successfully. That way Lambda service won't see any failures and no SQS messages will get lost either.
Something along these lines could help :)
AWS support says the trigger can be disabled because of the insufficient permissions for the Lambda execution role.
My question:
Where the conditions when the Lambda trigger can be automatically disabled are documented? Or where to find why the trigger was disabled (some kind of Lambda service logs)?
AWS support answer:
Currently, there is no such public documentation which mentions the possible reasons for the Lambda trigger being disabled automatically. However, as I mentioned earlier, the most probable reason for the SQS Lambda trigger being disabled is that the Lambda function execution role does not have one or more of the following required permissions:
- sqs:ChangeMessageVisibility
- sqs:DeleteMessage
- sqs:GetQueueAttribute
- sqs:ReceiveMessage
- Access to relevant KMS keys
- Any applicable cross account permissions
- Also, if the lambda function is in VPC, then the Lambda function should have all the permissions to list, create and delete the ENIs
Also, the reason for the trigger being disabled will not be mentioned in the Lambda function logs. So, I request you to please make sure that the Lambda function execution role has all the required permissions. If the Lambda function execution role has all the required permissions, the SQS trigger should not get disabled automatically.
In my case we actully missed the VPC permissions, i.e. we didn't attach the AWSLambdaVPCAccessExecutionRole
policy to the Lambda execution role. (I have no idea how Lambda worked without this policy). Five days passed since we've fixed the roles, no trigger was disabled. So, it works.
As for DynamoDB and "backpressure", the idea of MLu is correct.
If you have only one write to DynamoDB per each SQS message, you can just fail in Lambda if the write fails. The message stays in SQS and will be received by the Lambda again after the visibility timeout. It's better to use the batch size of 1 to process the messages one by one in this case.
If you have multiple writes to DynamoDB per each SQS message (the write multiplication), the better is to catch ProvisionedThroughputExceededException
in the Lambda and put the failed writes to another queue with delay to repeat them later by another Lambda. Note, it's important to repeat every single write, not the original message.
The dataflow will be like this:
Note, any delayed repeating of the writes are acceptable only if you really can delay and repeat them. They should be idempotent and should not contain real-time data. Otherwise, it can be better to silently ignore any exceptions, to avoid Lambda failure, and so, remove and forget the message from SQS.