Week 3 Assignment - High performance modelling

For week 3 assignment on “Distributed Multi-worker TensorFlow Training on Kubernetes”, I followed all the instructions as mentioned till step 6,
but I don’t see the status of the pods changing to Running.

Can instructors, please look into it and advise how to debug/fix it.

NAME READY STATUS RESTARTS AGE
multi-worker-worker-0 0/1 ErrImagePull 0 35s
multi-worker-worker-1 0/1 ErrImagePull 0 35s
multi-worker-worker-2 0/1 ErrImagePull 0 35s
student_01_f7a7b79139ea@cloudshell:~/lab-files (qwiklabs-gcp-03-0a2103eb388a)$ kubectl get pods
NAME READY STATUS RESTARTS AGE
multi-worker-worker-0 0/1 ImagePullBackOff 0 44s
multi-worker-worker-1 0/1 ImagePullBackOff 0 44s
multi-worker-worker-2 0/1 ImagePullBackOff 0 44s

Hi Manoj! It’s possible that your image is not yet deployed or the tfjob.yaml is not yet configured correctly. Please see this thread for tips. Hope it helps!

Hi Chris,

Thanks for your reply. After training the model for 5 epochs, I can see an error, but haven’t been able to understand it. can you please have a look into it.

Epoch 5/5
100/100 [==============================] - 11s 115ms/step - loss: 1.7060 - accuracy: 0.6954
INFO:root:Saving the trained model to: gs://klabs-gcp-02-a42a054c4a6c-bucket/saved_model_dir
2022-08-27 19:05:53.234405: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
Traceback (most recent call last):
File “/usr/lib/python3.6/runpy.py”, line 193, in _run_module_as_main
main”, mod_spec)
File “/usr/lib/python3.6/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/mnist/main.py”, line 116, in
args.checkpoint_path, args.saved_model_path)
File “/mnist/main.py”, line 84, in train
multi_worker_model.save(saved_model_dir)
File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py”, line 2002, in save
signatures, options, save_traces)
File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/saving/save.py”, line 157, in save_model
signatures, options, save_traces)
File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/saving/saved_model/save.py”, line 89, in save
save_lib.save(model, filepath, signatures, options)
File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/saved_model/save.py”, line 1038, in save
utils_impl.get_or_create_variables_dir(export_dir)
File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/saved_model/utils_impl.py”, line 220, in get_or_create_variables_dir
file_io.recursive_create_dir(variables_dir)
File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/lib/io/file_io.py”, line 468, in recursive_create_dir
recursive_create_dir_v2(dirname)
File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/lib/io/file_io.py”, line 483, in recursive_create_dir_v2
_pywrap_file_io.RecursivelyCreateDir(compat.path_to_bytes(path))
tensorflow.python.framework.errors_impl.InvalidArgumentError: ‘object’ must be a non-empty string. (File: gs://klabs-gcp-02-a42a054c4a6c-bucket/)

Events:
Type Reason Age From Message


Normal SuccessfulCreatePod 5m15s tf-operator Created pod: multi-worker-worker-0
Normal SuccessfulCreatePod 5m15s tf-operator Created pod: multi-worker-worker-1
Normal SuccessfulCreatePod 5m15s tf-operator Created pod: multi-worker-worker-2
Normal SuccessfulCreateService 5m14s tf-operator Created service: multi-worker-worker-0
Normal SuccessfulCreateService 5m14s tf-operator Created service: multi-worker-worker-1
Normal SuccessfulCreateService 5m14s tf-operator Created service: multi-worker-worker-2
Normal ExitedWithCode 3m28s tf-operator Pod: default.multi-worker-worker-0 exited with code 1
Normal TFJobFailed 3m28s tf-operator TFJob multi-worker has failed because 1 Worker replica(s) failed.

Thanks,

Please take a look at my reply on Error in upgrading TFjob Manifest - #15 by HarryXPan. Could be the bucket name issue.