Distributed Multi-worker TensorFlow Training on Kubernetes - Job not running

Getting the following output:
student_03_58a0e401dcfa@cloudshell:~/lab-files (qwiklabs-gcp-00-eba962a0a3b0)$ kubectl get pods
multi-worker-worker-0 0/1 ImagePullBackOff 0 37m
multi-worker-worker-1 0/1 ImagePullBackOff 0 37m
multi-worker-worker-2 0/1 ImagePullBackOff 0 37m

After running kubectl get pods
So I couldn’t get the job running. here is my output for tfjob.yaml
apiVersion: kubeflow.org/v1

kind: TFJob

metadata: # kpt-merge: /multi-worker

name: multi-worker


cleanPodPolicy: None



  replicas: 3




        - name: tensorflow

          image: gcr.io/qwiklabs-gcp-00-eba962a0a3b0/mnist-train


            - --epochs=5

            - --steps_per_epoch=100

            - --per_worker_batch=64

            - --saved_model_path=gs://qwiklabs-gcp-00-eba962a0a3b0-bucket/saved_model_dir

            - --checkpoint_path=gs://qwiklabs-gcp-00-eba962a0a3b0-bucket/checkpoints

Hi! Thank you for reporting. Will check this now and update you asap.

Hi! Unfortunately I can’t replicate the issue. The pods are created without issues.

When you do the lab, can you do a cat tfjob.yaml in the Cloud Shell and confirm that you see the same output that you expect (i.e. with the image and args revised? Also make sure that you’ve pushed the image to the Cloud Registry. That is done earlier in the lab with these commands:

docker build -t gcr.io/${PROJECT_ID}/${IMAGE_NAME} .
docker push gcr.io/${PROJECT_ID}/${IMAGE_NAME}

If successful, you should see the image when you paste the url in your browser on a separate window (e.g. gcr.io/qwiklabs-gcp-00-eba962a0a3b0/mnist-train). If you get a permission error, make sure you’re logged in as the Qwiklabs student account (just click on the icon on the upper right and make sure it’s pointed to the <id>@qwiklabs.net email that’s provided at the start of the lab.) . If the image is not found, please try pushing it again.

Hope this helps!

