C3W3 Distributed Multi-worker TF Training on kubernetes - edit TFJob -

Hello,

I’ve been trying many times to edit the TFJob yaml file to change the --saved_model_path and --checkpoint_path arguments but I am not able to do it.

Which are the commands to edit the TFJob or where can I find the file? The task requires to edit those parameters and update them in order to run properly… I got stuck at 70/100 because I can’t complete the last part of that lab, need further instructions. When I create the multi-workers it says “Ready 0/1” for each of the three.

Also, I get the error “Error from server (BadRequest): container “tensorflow” in pod “multi-worker-worker-0” is waiting to start: trying and failing to pull image”

Thank you

2 Likes

Hi Irene! Have you tried using the built-in Cloud Shell editor? There should be a button in the Cloud Shell terminal to switch to that interface. From there, you can see a file explorer where you can navigate to the YAML file that you want to edit. After that, save the file and run the kubectl commands again to hopefully get the correct image and model. Hope this helps!

2 Likes

It worked. Thank you

1 Like

Hi Chris! I already modified and saved the tfjob.yaml file


but the error still exists

Thank you in advance!

1 Like

Hi! Welcome to Discourse! Did you see your image when you ran this command?

gcloud container images list

Is it identical to the one you put in the YAML file? The error message seems to indicate that it cannot find the image. Kindly check. Thanks!

1 Like

Hi Quan, I’m stuck too there, followed all the steps but didn’t resolve the issue…
Did you find a way to get it done ?

Tankfully,

Hi Dahmani! I’m not sure if you also resolved the issue at the start of this lab. Please see the tip here re: the cluster version. I was able to complete the assignment with that. Also make sure that your image exists by running gcloud container images list. Hope this helps!

2 Likes

Hi, not yet. I followed all the steps and the image when ran by this command: gcloud container images list is indeed identical to the one I put in the YAML file.

Hi Chris,

Firstly, Thank you to be so responsvie.

I followed the steps of the lab and still stuck in that final task of my spécialization :

Please, find below SnapShoot of the terminal :

1

2

3

Could you please provide more details to fix the issue ?

Tnakfully,

RD

1 Like

Hi RD! From your second screenshot, it seems you forgot to edit the image field of the tfjob.yaml. It is shown as mnist whereas it should be gcr.io/<project_id>/mnist-train. You should see this when you apply the yaml file:

Then when you describe, that part of your second screenshot should look something like this (notice the Image field):

I just retried the lab and got it graded correctly. In case it fails again when you retry, please post screenshots of the tfjob.yaml, the results when applying, describing, and the output of gcloud container images list. You can also do a sanity check by opening a new tab and putting in the address of the image (i.e. gcr.io/<project_id>/mnist-train). It should look something like this and this will tell you that the image does indeed exist:

Hope these help!

Hi Chris,

Thank you for being too responsive,

I tried twice to resolve the problem, I did well and replace the error in the image fiels: “image : gcr…”

Same problem, it didn’t change anything, My account didn’t switch and I had the same state as your ScreenShot “Container registry”

Honestly, I wasted too much times in this lab and I wanted to give a last try but Somehow You get an error when creating the cluster, the GKE version is outdated.

Hi Dahmani. I gave the cluster version in the FAQ in one of my earlier replies. What version did you use? Also, please provide the same screenshots as you did earlier. Maybe there’s a new bug. Please dont give up. I just redid the lab a few days ago and you can finish it in 30minutes by using that cluster version and editing and applying the tfjob.yaml correctly. Every checkpoint should turn green before you move on to the next one. If you get an error, please do a screenshot of the message. You can do this!

Hi Chris !

Thank you for the encouragements , Actually , it’s the only lab where I stucked to,
As expected, there is a screen shot below :

There was an Updating last days, which caused the error when creating cluster

Sincerly ,

Woah now it’s also not working! Please try this cluster version instead: 1.20.15-gke.300

The command is now:

gcloud container clusters create $CLUSTER_NAME \
  --project=$PROJECT_ID \
  --release-channel=stable \
  --cluster-version=1.20.15-gke.300	 \
  --machine-type=n1-standard-4 \
  --scopes compute-rw,gke-default,storage-rw \
  --num-nodes=3

I just did the lab and got the grade again. In case you still get an error creating the cluster, please look at the latest stable version here and use that for the --cluster-version flag above.

Hope you get to complete it!

1 Like

Hi Chris,

I finally completed it and finished the MLOps specialization
Thank you so much for your support !

Best regards,

Awesome! Glad to hear that! Congratulations on completing the specialization!

Hi All, I have the same exact problem, my images are not being pulled.

I have created the cluster with the right pods, and have set up the right images:

student_01_07ff6d65b383@cloudshell:~/lab-files (qwiklabs-gcp-00-0cf496d13fde)$ gcloud container images list
NAME: gcr.io/qwiklabs-gcp-00-0cf496d13fde/mnist-train
Only listing images in gcr.io/qwiklabs-gcp-00-0cf496d13fde. Use --repository to list images in other repositories.

I have also checked that I habe the right images pulled into Docker

student_01_07ff6d65b383@cloudshell:~/lab-files (qwiklabs-gcp-00-0cf496d13fde)$ docker images --all
REPOSITORY                                        TAG       IMAGE ID       CREATED          SIZE
gcr.io/qwiklabs-gcp-00-0cf496d13fde/mnist-train   latest    2f3c7eab383c   20 minutes ago   1.59GB
<none>                                            <none>    269791f7312a   20 minutes ago   1.59GB
<none>                                            <none>    37ca5382d058   20 minutes ago   1.59GB
tensorflow/tensorflow                             2.4.1     45872ba1e662   14 months ago    1.57GB
student_01_07ff6d65b383@cloudshell:~/lab-files (qwiklabs-gcp-00-0cf496d13fde)$

I have updated the right path and image into the manifest file:

apiVersion: kubeflow.org/v1
kind: TFJob
metadata: # kpt-merge: /multi-worker
  name: multi-worker
spec:
  cleanPodPolicy: None
  tfReplicaSpecs:
    Worker:
      replicas: 3
      template:
        spec:
          containers:
            - name: tensorflow
              image: grc.io/qwiklabs-gcp-00-0cf496d13fde/mnist-train
              args:
                - --epochs=5
                - --steps_per_epoch=100
                - --per_worker_batch=64
                - --saved_model_path=gs://qwiklabs-gcp-00-0cf496d13fde-bucket/saved_model_dir
                - --checkpoint_path=gs://qwiklabs-gcp-00-0cf496d13fde-bucket/checkpoints

And I also see that the created job contains the right location of the image. The following is the last section of the command kubectl describe tfjob $JOB_NAME

<other lines above ...>
          Containers:
            Args:
              --epochs=5
              --steps_per_epoch=100
              --per_worker_batch=64
              --saved_model_path=gs://qwiklabs-gcp-00-0cf496d13fde-bucket/saved_model_dir
              --checkpoint_path=gs://qwiklabs-gcp-00-0cf496d13fde-bucket/checkpoints
            Image:  grc.io/qwiklabs-gcp-00-0cf496d13fde/mnist-train
            Name:   tensorflow
Status:
  Conditions:
    Last Transition Time:  2022-04-03T02:17:20Z
    Last Update Time:      2022-04-03T02:17:20Z
    Message:               TFJob multi-worker is created.
    Reason:                TFJobCreated
    Status:                True
    Type:                  Created

The job is still unable to pull the images from the containers

student_01_07ff6d65b383@cloudshell:~/lab-files (qwiklabs-gcp-00-0cf496d13fde)$ kubectl get pods
NAME                    READY   STATUS             RESTARTS   AGE
multi-worker-worker-0   0/1     ImagePullBackOff   0          18m
multi-worker-worker-1   0/1     ImagePullBackOff   0          18m
multi-worker-worker-2   0/1     ImagePullBackOff   0          18m

May I know what I am doing wrong?

I have realzed that the gcloud container create needs a different version. Now I’m all good!

@chris.favila I tried your code above and get this message below. I am on my 4th attempt to finish course 3, even finished course 4 just to feel like I am making progress. My company is paying for this course and they expected me to finish this last friday. I have read ALL the threads related to this ImagePullBackoff and nothing works. Why isn’t the course material for the lab update!?

This is becoming not only very very time consuming but also non constructive and painful.


*gcloud container clusters create $CLUSTER_NAME *
–project=PROJECT_ID \ --release-channel=stable \ --cluster-version=1.20.15-gke.300 \ --machine-type=n1-standard-4 \ --scopes compute-rw,gke-default,storage-rw \ --num-nodes=3 Default change: VPC-native is the default mode during cluster creation for versions greater than 1.21.0-gke.1500. To create advanced routes based clusters, please pass the `--no-enable-ip-alias` flag Default change: During creation of nodepools or autoscaling configuration changes for cluster versions greater than 1.24.1-gke.800 a default location policy is applied. For Spot and PVM it defaults to ANY, and for all other VM kinds a BALANCED policy is used. To change the default values use the `--location-policy` flag. Note: Your Pod address range (`--cluster-ipv4-cidr`) can accommodate at most 1008 node(s). ERROR: (gcloud.container.clusters.create) ResponseError: code=400, message=Master version "1.20.15-gke.300" is unsupported. student_04_234c4256390d@cloudshell:~ (qwiklabs-gcp-01-3af109cfb8ea) gcloud container clusters get-credentials $CLUSTER_NAME
Fetching cluster endpoint and auth data.
ERROR: (gcloud.container.clusters.get-credentials) ResponseError: code=404, message=Not found: projects/qwiklabs-gcp-01-3af109cfb8ea/zones/us-central1-f/clusters/cluster-1.
No cluster named ‘cluster-1’ in qwiklabs-gcp-01-3af109cfb8ea.