C3W3 Distributed Multi-worker TensorFlow Training on Kubernetes - ImagePullBackOff or ErrImagePull error

The TFjob is not initializing properly. An ImagePullbackoff or ErrImagePull error appears when getting pods. Does anyone face this problem?

Hi!

Which Course is this for?

Hi SamReiswig.
The course is Machine learning modeling pipelines in production Course 3 of MLOps Specialization

This is the Forum for the Machine Learning Specialization. Let me see if I can move this to the MLOps Section.

You have to edit the tfjob.yaml file in 2 places:

  1. as mentioned in the guide you have to edit both the path as per your GCP instance
  2. You also have to edit the image variable as shown in the image I have attached, this you would have got in the step above this step.

If you have already started the process you may have to kill it using this command

kubectl delete tfjob $JOB_NAME

and then follow the guide as given.

2 Likes

This and the post by @ minggatsby helped me complete this lab after multiple failures. Thank you!

1 Like

This the critical instruction as updating the yaml file does not kill the previous jobs.