Hi Chris,
Thanks for your reply. After training the model for 5 epochs, I can see an error, but haven’t been able to understand it. can you please have a look into it.
Epoch 5/5
100/100 [==============================] - 11s 115ms/step - loss: 1.7060 - accuracy: 0.6954
INFO:root:Saving the trained model to: gs://klabs-gcp-02-a42a054c4a6c-bucket/saved_model_dir
2022-08-27 19:05:53.234405: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
Traceback (most recent call last):
File “/usr/lib/python3.6/runpy.py”, line 193, in _run_module_as_main
“main”, mod_spec)
File “/usr/lib/python3.6/runpy.py”, line 85, in _run_code
exec(code, run_globals)
File “/mnist/main.py”, line 116, in
args.checkpoint_path, args.saved_model_path)
File “/mnist/main.py”, line 84, in train
multi_worker_model.save(saved_model_dir)
File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py”, line 2002, in save
signatures, options, save_traces)
File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/saving/save.py”, line 157, in save_model
signatures, options, save_traces)
File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/saving/saved_model/save.py”, line 89, in save
save_lib.save(model, filepath, signatures, options)
File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/saved_model/save.py”, line 1038, in save
utils_impl.get_or_create_variables_dir(export_dir)
File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/saved_model/utils_impl.py”, line 220, in get_or_create_variables_dir
file_io.recursive_create_dir(variables_dir)
File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/lib/io/file_io.py”, line 468, in recursive_create_dir
recursive_create_dir_v2(dirname)
File “/usr/local/lib/python3.6/dist-packages/tensorflow/python/lib/io/file_io.py”, line 483, in recursive_create_dir_v2
_pywrap_file_io.RecursivelyCreateDir(compat.path_to_bytes(path))
tensorflow.python.framework.errors_impl.InvalidArgumentError: ‘object’ must be a non-empty string. (File: gs://klabs-gcp-02-a42a054c4a6c-bucket/)
Events:
Type Reason Age From Message
Normal SuccessfulCreatePod 5m15s tf-operator Created pod: multi-worker-worker-0
Normal SuccessfulCreatePod 5m15s tf-operator Created pod: multi-worker-worker-1
Normal SuccessfulCreatePod 5m15s tf-operator Created pod: multi-worker-worker-2
Normal SuccessfulCreateService 5m14s tf-operator Created service: multi-worker-worker-0
Normal SuccessfulCreateService 5m14s tf-operator Created service: multi-worker-worker-1
Normal SuccessfulCreateService 5m14s tf-operator Created service: multi-worker-worker-2
Normal ExitedWithCode 3m28s tf-operator Pod: default.multi-worker-worker-0 exited with code 1
Normal TFJobFailed 3m28s tf-operator TFJob multi-worker has failed because 1 Worker replica(s) failed.
Thanks,