Course 3 Week 3

How to update the --saved_model_path and --checkpoint_path?
I’m not able to understand. It would be a great help if someone help me out here.

Thank you.

I’ve moved your post to MLOPS course 3.

In the job.yaml file, replace qwiklabs-gcp-01-93af833e6576 with your project id. These are the places you’ll end up replacing:

  1. image
  2. –saved_model_path
  3. –checkpoint_path

Actually I’m understanding that. Just not able to understand how to update it. I did try to find a solution in google also but no such result. Can you help me out here?

Sure. Remember that you are inside the lab-files directory. This has a tfjob.yaml file.
Open this file on the GCP shell edit it. You can use an editor like vim or vi and make changes.

1 Like

Hi Ehtesham! In addition to what Balaji said, you can also use the built-in Cloud Shell Editor to edit the YAML file. There should be an Open Editor button at the top right of the Cloud Shell terminal. That has a more intuitive UI and you can navigate to the tfjob.yaml in the left panel to edit the lines mentioned in the instructions. Make sure to save your changes then go back to the Cloud Shell by clicking the Open Terminal. From there, you can execute the next instructions. Hope this helps!

Hi,
Yes I did the following steps still getting this error when I’m trying to retrieve the logs for the chief (worker 0) . Can help me out here?

This is my YAML file

Hi Ehtesham! Sorry for the late reply. Discourse did not send a notification so I didn’t see this sooner. re: your latest output, that is strange. It seems it can’t see the image. Can you show here the output of:

gcloud container images list

before you edit tfjob.yaml? Then please also show the output of this command:

JOB_NAME=multi-worker
kubectl describe tfjob $JOB_NAME

after you apply tfjob.yaml. This might show if there are mismatching values. Thanks!

Odd. I didn’t get a notification earlier as well. Just got notified via Chris’ reply now.

Hi Chris and Balaji,
Thank you for your help, I was able to complete it afterwards.

1 Like

I’m encountering exactly same error, can I get to know why is it happening, my pods are 0 and cannot find image, if anyone can help, please do let me know !

1 Like

Anyone who faces the same issue should go through this query
at the end @chris.favila has given the solution, please go through his given step where he asks to restart the cluster with stable version

For the stable version part, I just went to the site mentioned in his comment and copied the stable version and restarted the lab, its working now.