What's the recommended way to replace a bad GKE node instance?
We use multi-zone clusters now which means I needed a new way to get the instance group name. Current shell commands:
BAD_INSTANCE=[your node name from kubectl get nodes]
kubectl cordon $BAD_INSTANCE
kubectl drain $BAD_INSTANCE
gcloud compute instances describe --format='value[](metadata.items.created-by)' $BAD_INSTANCE
gcloud compute instance-groups managed delete-instances --instances=$BAD_INSTANCE --zone=[from describe output] [grp from describe output]
However I find no way to target a specific compute instance vm for removal when resizing down.
There isn't a way to specify which VM to remove using the GKE API, but you can use the managed instance groups API to delete individual instances from the group (this will shrink your number of nodes by the number of instances that you delete, so if you want to replace the nodes, you will then want to scale your cluster up to compensate). You can find the instance group name by running:
$ gcloud container clusters describe CLUSTER | grep instanceGroupManagers
Is it safe to simply resize up and then delete the instance using gcloud compute, or is there some container-aware way to do this?
If you delete an instance, the managed instance group will replace it with a new one (so this will leave you with an extra node if you scale up by one, then delete the troublesome instance). If you were not concerned about the temporary loss of capacity, you could just delete the VM and let it get recreated.
Before removing an instance, you can run kubectl drain to remove the workload from the instance. This will result in a faster rescheduling of pods than if you simply deleting the instance and wait for the controllers to notice that it is gone.