Week 3 Assignment - High performance modelling

Manoj_Kumar_Mahato · August 7, 2022, 11:55am

For week 3 assignment on “Distributed Multi-worker TensorFlow Training on Kubernetes”, I followed all the instructions as mentioned till step 6,
but I don’t see the status of the pods changing to Running.

Can instructors, please look into it and advise how to debug/fix it.

NAME READY STATUS RESTARTS AGE
multi-worker-worker-0 0/1 ErrImagePull 0 35s
multi-worker-worker-1 0/1 ErrImagePull 0 35s
multi-worker-worker-2 0/1 ErrImagePull 0 35s
student_01_f7a7b79139ea@cloudshell:~/lab-files (qwiklabs-gcp-03-0a2103eb388a)$ kubectl get pods
NAME READY STATUS RESTARTS AGE
multi-worker-worker-0 0/1 ImagePullBackOff 0 44s
multi-worker-worker-1 0/1 ImagePullBackOff 0 44s
multi-worker-worker-2 0/1 ImagePullBackOff 0 44s

chris.favila · August 12, 2022, 3:45am

Hi Manoj! It’s possible that your image is not yet deployed or the tfjob.yaml is not yet configured correctly. Please see this thread for tips. Hope it helps!

Manoj_Kumar_Mahato · August 27, 2022, 7:09pm

Hi Chris,

Thanks for your reply. After training the model for 5 epochs, I can see an error, but haven’t been able to understand it. can you please have a look into it.

Epoch 5/5
100/100 [==============================] - 11s 115ms/step - loss: 1.7060 - accuracy: 0.6954
INFO:root:Saving the trained model to: gs://klabs-gcp-02-a42a054c4a6c-bucket/saved_model_dir
2022-08-27 19:05:53.234405: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
Traceback (most recent call last):
File “/usr/lib/python3.6/runpy.py”, line 193, in _run_module_as_main
“main”, mod_spec)
File “/usr/lib/python3.6/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/mnist/main.py”, line 116, in
args.checkpoint_path, args.saved_model_path)
File “/mnist/main.py”, line 84, in train
multi_worker_model.save(saved_model_dir)
File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py”, line 2002, in save
signatures, options, save_traces)
File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/saving/save.py”, line 157, in save_model
signatures, options, save_traces)
File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/saving/saved_model/save.py”, line 89, in save
save_lib.save(model, filepath, signatures, options)
File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/saved_model/save.py”, line 1038, in save
utils_impl.get_or_create_variables_dir(export_dir)
File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/saved_model/utils_impl.py”, line 220, in get_or_create_variables_dir
file_io.recursive_create_dir(variables_dir)
File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/lib/io/file_io.py”, line 468, in recursive_create_dir
recursive_create_dir_v2(dirname)
File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/lib/io/file_io.py”, line 483, in recursive_create_dir_v2
_pywrap_file_io.RecursivelyCreateDir(compat.path_to_bytes(path))
tensorflow.python.framework.errors_impl.InvalidArgumentError: ‘object’ must be a non-empty string. (File: gs://klabs-gcp-02-a42a054c4a6c-bucket/)

Events:
Type Reason Age From Message

Normal SuccessfulCreatePod 5m15s tf-operator Created pod: multi-worker-worker-0
Normal SuccessfulCreatePod 5m15s tf-operator Created pod: multi-worker-worker-1
Normal SuccessfulCreatePod 5m15s tf-operator Created pod: multi-worker-worker-2
Normal SuccessfulCreateService 5m14s tf-operator Created service: multi-worker-worker-0
Normal SuccessfulCreateService 5m14s tf-operator Created service: multi-worker-worker-1
Normal SuccessfulCreateService 5m14s tf-operator Created service: multi-worker-worker-2
Normal ExitedWithCode 3m28s tf-operator Pod: default.multi-worker-worker-0 exited with code 1
Normal TFJobFailed 3m28s tf-operator TFJob multi-worker has failed because 1 Worker replica(s) failed.

Thanks,

HarryXPan · February 22, 2023, 7:46am

Please take a look at my reply on Error in upgrading TFjob Manifest - #15 by HarryXPan. Could be the bucket name issue.

Topic		Replies	Views
Week3: Graded external tool: Error in starting job Machine Learning Modeling Pipelines in Production	2	614	September 15, 2021
Distributed Multi-worker TensorFlow Training on Kubernetes - Job not running Machine Learning Modeling Pipelines in Production	5	592	September 26, 2021
C3W3-ImagePullBackOff, not "Running" Machine Learning Modeling Pipelines in Production	7	702	September 24, 2022
C3W3_Graded Lab_The status of all pods does not change to Running Machine Learning Modeling Pipelines in Production	2	808	July 28, 2022
Distributed Multi-worker TensorFlow Training on Kubernetes- Pods status is "ImagePullBackOff" Machine Learning Modeling Pipelines in Production	7	816	August 16, 2021

Week 3 Assignment - High performance modelling

Related topics