DefaultCredentialsError in GKE Workload identity
I extensively use Google Workload Identity to authenticate my workloads running on GKE clusters. This allows me to give different permissions to different pods as they run.
For example, I can use workload identity to allow only certain pods to access a certain Google Cloud Storage bucket or push Docker images to the Google Container registry.
Workload Identity works great allows you to bind a Kubernetes Service account with a Google service account. Doing that with the dedicated Terraform module works great and has a very low entry barrier.
While working on setting up Dask clusters though, I started experiencing an intermitting issue that led me to do some additional investigation. The problem is not related to the application itself, but it appears more often whenever you have dynamic workloads and the necessity to scale up and down rapidly.
It is important to mention here that I configured these cluster with workload identity so that every worker could download data from Google Cloud Storage.
Sometimes, when the cluster would start some of the workers (sometimes none, sometimes a few) would fail to start with the following error:
DefaultCredentialsError: Could not automatically determine credentials.
Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials
and re-run the application. For more information, please see
https://cloud.google.com/docs/authentication/getting-started
This immediately prompted some questions… I have been using Workload Identity for a while now and I was reasonably sure that it was well configured. Also, the problem was very sporadic and difficult to reproduce… Sometimes you would get half of the workers failing, sometimes a few and sometimes none.
If most of the workers have no problem though… this means that Workload Identity is set up correctly… The problem must be somewhere else.
After some debugging and reading online I stumbled upon the following sentence in the Workload Identity documentation:
The GKE metadata server needs a few seconds before it can start accepting requests on a newly created Pod. Therefore, attempts to authenticate using Workload Identity within the first few seconds of a Pod’s life might fail. Retrying the call will resolve the problem.
Here we go… Indeed after some additional testing, I realized that the problem was more likely to happen on newly spawned nodes than on the old ones.
The fix? You can rather implement retry logic in your code or wait for the metadata server to be ready before processing work.
I decided to add a small init container that checks if the metadata server is ready to accept requests before starting the Dask worker. In this way, we do not need to adjust any pre-existing code.
initContainers:
- name: test-gke-metadataserver
image: curlimages/curl:7.75.0
command:
- sh
- "-c"
- "until test $(curl -s -o /dev/null -w '%{http_code}' -H 'Metadata-Flavor: Google'
http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token) == '200'; do echo 'Waiting for metadata server...'; sleep 2; done; echo 'Metadata server available'"
Note: Since I fixed this problem some months back the GKE team published their solution which includes as well an init container. You can find more information on the Workload Identity Limitations page.