C3W3 Distributed Multi-worker TensorFlow Training on Kubernetes -- Workers never run

Hello all,

Could anyone help with my C3W3 graded external tool?

The status of my three workers stays in ImagePullBackOff for more than 15 minutes. If I run the command “kubectl logs --follow ${JOB_NAME}-worker-0” it says it failed to pull images

My yaml file looks like this. I don’t think there is any wrong setting.

Could anyone suggest how to make the workers run?

Thanks.

Gustav

Occasionally


the workers had problem pulling images.

Please fix the image value in the manifest file.
It should match the image name you pushed to the registry. Changing mnist to mnist-train should fix this error.

I have the same issue. Where would I find this image file?

See Packaging training code in a docker image section where you build and push mnist-train to GCR.

To clarify, the image should have name gcr.io/<YOUR_PROJECT_ID>/mnist-train

2 Likes

After your suggested change, the workers are running now. Thanks.