Update the image and args

quartermaine · July 4, 2021, 3:47pm

Next, update the --saved_model_path and --checkpoint_path arguments by replacing the bucket token with the name of you Cloud storage bucket. Recall that your bucket name is [YOUR_PROJECT_ID]-bucket .

what do we need to here?

tranvinhcuong · July 5, 2021, 1:13am

hi @quartermaine , welcome to the course!

When doing model training you need someplace to store the model and the intermediate result of the model (checkpoint), here you use a cloud storage bucket. To use that you need to specify the id of the bucket which you have created in previous steps.

Hope it helps,
Cuong

quartermaine · July 5, 2021, 9:52am

hi @tranvinhcuong ,
Thank you for the information, I have edited the tfjob.yaml file with the storage bucket but when I run the command kubectl logs --follow ${JOB_NAME}-worker-0 I get

Error from server (BadRequest): container “tensorflow” in pod “multi-worker-worker-0” is waiting to start: trying and failing to pull image

tranvinhcuong · July 5, 2021, 10:36am

hi @quartermaine ,
the error message said something wrong with the docker image, can you check you got the correct tag for the image?

quartermaine · July 6, 2021, 8:24am

hi @tranvinhcuong ,
I was able to pass the lab as you mentioned there was actually a wrong tag image.

qchaldemer · July 8, 2021, 2:51am

where do you find the tfjob.yaml file? thanks

quartermaine · July 8, 2021, 8:31am

Hi @qchaldemer ,

In the cloud shell press the button open editor and you see a list of files, the tfjob.yaml file is located in the lab-files file and you can edit it there.

Topic		Replies	Views
C3W3 Distributed Multi-worker TF Training on kubernetes - edit TFJob - Machine Learning Modeling Pipelines in Production	23	1104	July 13, 2023
Course 3 Week 3 Machine Learning Modeling Pipelines in Production	12	745	June 9, 2022
C3W3 - update the image field Machine Learning Modeling Pipelines in Production	1	597	August 14, 2021
Error in upgrading TFjob Manifest Machine Learning Modeling Pipelines in Production	15	707	May 3, 2023
C3W3 -Problems with Distributed Multi-worker TensorFlow Training on Kubernetes Machine Learning Modeling Pipelines in Production	5	606	July 21, 2022

Update the image and args

Related topics