stackdriver-metadata-agent-cluster-level gets OOMKilled
I was about to open a support ticket with GCP, but they have this notice:
Description We are experiencing issue with Fluentd crashlooping in Google Kubernetes Engine where master version is 1.14 or 1.15, when gVisor is enabled. The fix is targeted for a release aiming to begin on 17 April 2020. We will provide more updates as the date gets closer. We will provide an update by Thursday, 2020-04-09 14:30 US/Pacific with current details. We apologize to all who are affected by the disruption.
Start time April 2, 2020 at 10:58:24 AM GMT-7
End time Steps to reproduce Fluentd crashloops in GKE clusters could lead to missing logs.
Workaround Upgrade Google Kubernetes Engine cluster masters to version 1.16+.
Affected products Other
The issue is being caused because the LIMIT set on the metadata-agent
deployment is too low on resources so the POD is being killed (OOM killed) since the POD requires more memory to properly work.
There is a workaround for this issue until it is fixed.
You can overwrite the base resources in the configmap of the metadata-agent
with:
kubectl edit cm -n kube-system metadata-agent-config
Setting baseMemory: 50Mi
should be enough, if it doesn't work use higher value 100Mi
or 200Mi
.
So metadata-agent-config
configmap should look something like this:
apiVersion: v1
data:
NannyConfiguration: |-
apiVersion: nannyconfig/v1alpha1
kind: NannyConfiguration
baseMemory: 50Mi
kind: ConfigMap
Note also that You need to restart the deployment, as the config map doesn't get picked up automatically:
kubectl delete deployment -n kube-system stackdriver-metadata-agent-cluster-level
For more details look into addon-resizer Documentation.