Course 3 Week 3

Ehtesham_Nehal · April 28, 2022, 6:10pm

How to update the --saved_model_path and --checkpoint_path?
I’m not able to understand. It would be a great help if someone help me out here.

Thank you.

balaji.ambresh · April 28, 2022, 7:12pm

I’ve moved your post to MLOPS course 3.

balaji.ambresh · April 28, 2022, 7:17pm

In the job.yaml file, replace qwiklabs-gcp-01-93af833e6576 with your project id. These are the places you’ll end up replacing:

image
–saved_model_path
–checkpoint_path

Ehtesham_Nehal · April 28, 2022, 8:52pm

Actually I’m understanding that. Just not able to understand how to update it. I did try to find a solution in google also but no such result. Can you help me out here?

balaji.ambresh · April 29, 2022, 4:32am

Sure. Remember that you are inside the lab-files directory. This has a tfjob.yaml file.
Open this file on the GCP shell edit it. You can use an editor like vim or vi and make changes.

chris.favila · April 29, 2022, 1:24pm

Hi Ehtesham! In addition to what Balaji said, you can also use the built-in Cloud Shell Editor to edit the YAML file. There should be an Open Editor button at the top right of the Cloud Shell terminal. That has a more intuitive UI and you can navigate to the tfjob.yaml in the left panel to edit the lines mentioned in the instructions. Make sure to save your changes then go back to the Cloud Shell by clicking the Open Terminal. From there, you can execute the next instructions. Hope this helps!

Ehtesham_Nehal · April 30, 2022, 3:28pm

Hi,
Yes I did the following steps still getting this error when I’m trying to retrieve the logs for the chief (worker 0) . Can help me out here?

Ehtesham_Nehal · April 30, 2022, 3:29pm

This is my YAML file

chris.favila · May 6, 2022, 12:36pm

Hi Ehtesham! Sorry for the late reply. Discourse did not send a notification so I didn’t see this sooner. re: your latest output, that is strange. It seems it can’t see the image. Can you show here the output of:

gcloud container images list

before you edit tfjob.yaml? Then please also show the output of this command:

JOB_NAME=multi-worker
kubectl describe tfjob $JOB_NAME

after you apply tfjob.yaml. This might show if there are mismatching values. Thanks!

balaji.ambresh · May 6, 2022, 12:55pm

Odd. I didn’t get a notification earlier as well. Just got notified via Chris’ reply now.

Ehtesham_Nehal · May 7, 2022, 7:06am

Hi Chris and Balaji,
Thank you for your help, I was able to complete it afterwards.

Muhammad_Ahmed_Nizam · June 9, 2022, 10:10pm

I’m encountering exactly same error, can I get to know why is it happening, my pods are 0 and cannot find image, if anyone can help, please do let me know !

Muhammad_Ahmed_Nizam · June 9, 2022, 10:49pm

Anyone who faces the same issue should go through this query
at the end @chris.favila has given the solution, please go through his given step where he asks to restart the cluster with stable version

For the stable version part, I just went to the site mentioned in his comment and copied the stable version and restarted the lab, its working now.

Topic		Replies	Views
C3W3 Distributed Multi-worker TF Training on kubernetes - edit TFJob - Machine Learning Modeling Pipelines in Production	23	1104	July 13, 2023
Error in upgrading TFjob Manifest Machine Learning Modeling Pipelines in Production	15	707	May 3, 2023
Update the image and args Machine Learning Modeling Pipelines in Production	6	597	July 8, 2021
C3W3: Graded Assignment: How to update the TFjob manifest? Machine Learning Modeling Pipelines in Production	1	564	July 1, 2022
C3W3 - update the image field Machine Learning Modeling Pipelines in Production	1	597	August 14, 2021

Course 3 Week 3

Related topics