Error on kubectl logs --follow ${JOB_NAME}-worker-0

2021-07-14 17:07:15.022011: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library ‘libcudart.so.11.0’; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-07-14 17:07:15.022052: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
WARNING:tensorflow:From /mnist/main.py:58: _CollectiveAllReduceStrategyExperimental.init (from tensorflow.python.distribute.collective_all_reduce_strategy) is deprecated and will be removed in a future version.
Instructions for updating:
use distribute.MultiWorkerMirroredStrategy instead
2021-07-14 17:07:16.916991: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-07-14 17:07:16.917292: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library ‘libcuda.so.1’; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-07-14 17:07:16.917326: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2021-07-14 17:07:16.917351: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (multi-worker-worker-0): /proc/driver/nvidia/version does not exist
2021-07-14 17:07:16.918143: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-07-14 17:07:16.918661: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-07-14 17:07:16.919226: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-07-14 17:07:16.924143: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker → {0 → multi-worker-worker-0.default.svc:2222, 1 → multi-worker-worker-1.default.svc:2222, 2 → multi-worker-worker-2.default.svc:2222}
2021-07-14 17:07:16.924611: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:411] Started server with target: grpc://multi-worker-worker-0.default.svc:2222
INFO:tensorflow:Enabled multi-worker collective ops with available devices: [’/job:worker/replica:0/task:0/device:CPU:0’]
INFO:tensorflow:Using MirroredStrategy with devices (’/job:worker/task:0’,)
INFO:tensorflow:Waiting for the cluster, timeout = inf
INFO:tensorflow:Cluster is ready.
INFO:tensorflow:MultiWorkerMirroredStrategy with cluster_spec = {‘worker’: [‘multi-worker-worker-0.default.svc:2222’, ‘multi-worker-worker-1.default.svc:2222’, ‘multi-worker-worker-2.default.svc:2222’]}, task_type = ‘worker’, task_id = 0, num_workers = 3, local_devices = (’/job:worker/task:0’,), communication = CommunicationImplementation.AUTO
INFO:absl:Load pre-computed DatasetInfo (eg: splits, num examples,…) from GCS: mnist/3.0.1
INFO:absl:Load dataset info from /tmp/tmpxljshpm_tfds
INFO:absl:Field info.citation from disk and from code do not match. Keeping the one from code.
INFO:absl:Field info.splits from disk and from code do not match. Keeping the one from code.
INFO:absl:Field info.module_name from disk and from code do not match. Keeping the one from code.
INFO:absl:Generating dataset mnist (/root/tensorflow_datasets/mnist/3.0.1)
INFO:absl:Dataset mnist is hosted on GCS. It will automatically be downloaded to your
local data directory. If you’d instead prefer to read directly from our public
GCS bucket (recommended if you’re running on GCP), you can instead pass
try_gcs=True to tfds.load or set data_dir=gs://tfds-data/datasets.

INFO:absl:Load dataset info from /root/tensorflow_datasets/mnist/3.0.1.incompleteTCN74M
INFO:absl:Field info.citation from disk and from code do not match. Keeping the one from code.
INFO:absl:Field info.splits from disk and from code do not match. Keeping the one from code.
INFO:absl:Field info.module_name from disk and from code do not match. Keeping the one from code.
INFO:absl:Constructing tf.data.Dataset mnist for split None, from /root/tensorflow_datasets/mnist/3.0.1
2021-07-14 17:07:22.915871: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-07-14 17:07:22.916398: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2299995000 Hz
INFO:tensorflow:Collective batch_all_reduce: 1 all-reduces, num_devices = 1, group_size = 3, implementation = AUTO, num_packs = 1
INFO:tensorflow:Collective batch_all_reduce: 1 all-reduces, num_devices = 1, group_size = 3, implementation = AUTO, num_packs = 1
WARNING:tensorflow:/job:worker/replica:0/task:1 seems down, retrying 1/3
WARNING:tensorflow:/job:worker/replica:0/task:1 seems down, retrying 1/3
WARNING:tensorflow:/job:worker/replica:0/task:1 seems down, retrying 2/3
WARNING:tensorflow:/job:worker/replica:0/task:1 seems down, retrying 2/3
ERROR:tensorflow:Cluster check alive failed, /job:worker/replica:0/task:1 is down, aborting collectives: Deadline Exceeded

Hi Lucas! Welcome to Discourse! Are the pods already marked as Running in the previous section before you run this command? If not, please check if you’ve updated the image and args entries in the tfjob.yaml file. That is usually the problem here. You can use the Cloud Shell Editor to navigate through the files and edit it again if needed. Hope this helps!