Distributed Multi-worker TensorFlow Training on Kubernetes- Pods status is "ImagePullBackOff"

Hi Everyone,
pods status is ImagePullBackOff.

Image creation is failed with below error .
“Failed to pull image “mnist”: rpc error: code = Unknown desc = failed to pull and unpack image “docker.io/library/mnist:latest”: failed to resolve reference “docker.io/library/mnist:latest”: pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed”

Thanks.

Hi Kiran! Welcome to Discourse! The common issue here is the image and args argument might not have been updated correctly in the YAML file. Please check that part again. You can use the Cloud Shell Editor to edit the file again if needed. Hope this helps!

1 Like

Issue is resolved. Thank you Chris.

@chris.favila I am facing similar issue though I have updated the yaml file

containers:
            - name: tensorflow
              image: gcr.io/qwiklabs-gcp-01-8c7f6bd2d495/mnist-train
              args:
                - --epochs=5
                - --steps_per_epoch=100
                - --per_worker_batch=64
                - --saved_model_path=gs://qwiklabs-gcp-01-8c7f6bd2d495-bucket/saved_model_dir
                - --checkpoint_path=gs://qwiklabs-gcp-01-8c7f6bd2d495-bucket/checkpoints

same…is your problem solved

I am also facing the same problem. Here is my tfjob.yaml file


apiVersion: kubeflow.org/v1
kind: TFJob
metadata: # kpt-merge: /multi-worker
  name: multi-worker
spec:
  cleanPodPolicy: None
  tfReplicaSpecs:
    Worker:
      replicas: 3
      template:
        spec:
          containers:
            - name: tensorflow
              image: gcr.io/qwiklabs-gcp-04-7945376c8032/mnist-train
              args:
                - --epochs=5
                - --steps_per_epoch=100
                - --per_worker_batch=64
                - --saved_model_path=gs://qwiklabs-gcp-04-7945376c8032-bucket/saved_model_dir
                - --checkpoint_path=gs://qwiklabs-gcp-04-7945376c8032-bucket/checkpoints

I verified that the image has been pushed

student_01_1ce2cbb5f651@cloudshell:~/lab-files (qwiklabs-gcp-04-7945376c8032)$ gcloud container images list

NAME
gcr.io/qwiklabs-gcp-04-7945376c8032/mnist-train
Only listing images in gcr.io/qwiklabs-gcp-04-7945376c8032. Use --repository to list images in other repositories.

Not sure what am I missing here.

1 Like

@chris.favila could you please suggest next steps for this?

I was able to fix this by delete the job and then recreating it. For some reason, if a bad job is created it doesn’t update.

kubectl delete tfjob $JOB_NAME

7 Likes