Distributed Multi-worker TensorFlow Training on Kubernetes - Job not running

Getting the following output:
student_03_58a0e401dcfa@cloudshell:~/lab-files (qwiklabs-gcp-00-eba962a0a3b0)$ kubectl get pods
NAME READY STATUS RESTARTS AGE
multi-worker-worker-0 0/1 ImagePullBackOff 0 37m
multi-worker-worker-1 0/1 ImagePullBackOff 0 37m
multi-worker-worker-2 0/1 ImagePullBackOff 0 37m

After running kubectl get pods
So I couldn’t get the job running. here is my output for tfjob.yaml
apiVersion: kubeflow.org/v1

kind: TFJob

metadata: # kpt-merge: /multi-worker

name: multi-worker

spec:

cleanPodPolicy: None

tfReplicaSpecs:

Worker:

  replicas: 3

  template:

    spec:

      containers:

        - name: tensorflow

          image: gcr.io/qwiklabs-gcp-00-eba962a0a3b0/mnist-train

          args:

            - --epochs=5

            - --steps_per_epoch=100

            - --per_worker_batch=64

            - --saved_model_path=gs://qwiklabs-gcp-00-eba962a0a3b0-bucket/saved_model_dir

            - --checkpoint_path=gs://qwiklabs-gcp-00-eba962a0a3b0-bucket/checkpoints

Issue is still there.please check.

Hi! Thank you for reporting. Will check this now and update you asap.

Hi! Unfortunately I can’t replicate the issue. The pods are created without issues.

When you do the lab, can you do a cat tfjob.yaml in the Cloud Shell and confirm that you see the same output that you expect (i.e. with the image and args revised? Also make sure that you’ve pushed the image to the Cloud Registry. That is done earlier in the lab with these commands:

IMAGE_NAME=mnist-train
docker build -t gcr.io/${PROJECT_ID}/${IMAGE_NAME} .
docker push gcr.io/${PROJECT_ID}/${IMAGE_NAME}

If successful, you should see the image when you paste the url in your browser on a separate window (e.g. gcr.io/qwiklabs-gcp-00-eba962a0a3b0/mnist-train). If you get a permission error, make sure you’re logged in as the Qwiklabs student account (just click on the icon on the upper right and make sure it’s pointed to the <id>@qwiklabs.net email that’s provided at the start of the lab.) . If the image is not found, please try pushing it again.

Hope this helps!

Did you find a solution? I’m facing the same issue. Thanks

I’m facing the same issue.