Issue:
I’m having issue running the MLEP Week3 lab3 Distributed Multi-worker TensorFlow Training on Kubernetes assignment. GSP775
The workers appear to start but worker-1 exits at start of processing with error:
Pod: default.multi-worker-worker-1 exited with code 1
Request:
Please advise how to resolve this issue.
Background:
Running lab in Chrome Incognito window - MacBook Air-M1 MacOS 12.3.1
The tfjob.yaml was edited using vim
Discussion:
I encountered this same error running lab yesterday. I reported the issue and Qwiklab restored lab access. Running the lab today yielded same result. I believe that I edited the tfjob.yaml file correctly.
Below are the log files excerpts from worker-1. The log for worker-1 complains about cuInit error but web search appears to indicate that this is not related to issue I’m encountering.
##
##-Log entry is:
Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-05-16 18:57:26.974326: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library ‘libcuda.so.1’; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-05-16 18:57:26.974356: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
##
##-Log later shows XLA complaints and then job starts.
##-Then complaints about “citation from disk and from code do not match”
##-Then job is failed
2022-05-16 18:57:26.976158: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-05-16 18:57:26.988537: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker → {0 → multi-worker-worker-0.default.svc:2222, 1 → multi-worker-worker-1.default.svc:2222, 2 → multi-worker-worker-2.default.svc:2222}
2022-05-16 18:57:26.989088: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:411] Started server with target: grpc://multi-worker-worker-1.default.svc:2222
INFO:tensorflow:Enabled multi-worker collective ops with available devices: [’/job:worker/replica:0/task:1/device:CPU:0’]
INFO:tensorflow:Using MirroredStrategy with devices (’/job:worker/task:1’,)
INFO:tensorflow:Waiting for the cluster, timeout = inf
INFO:tensorflow:Cluster is ready.
INFO:tensorflow:MultiWorkerMirroredStrategy with cluster_spec = {‘worker’: [‘multi-worker-worker-0.default.svc:2222’, ‘multi-worker-worker-1.default.svc:2222’, ‘multi-worker-worker-2.default.svc:2222’]}, task_type = ‘worker’, task_id = 1, num_workers = 3, local_devices = (’/job:worker/task:1’,), communication = CommunicationImplementation.AUTO
INFO:absl:Load pre-computed DatasetInfo (eg: splits, num examples,…) from GCS: mnist/3.0.1
INFO:absl:Load dataset info from /tmp/tmpoov2q2bdtfds
INFO:absl:Field info.citation from disk and from code do not match. Keeping the one from code.
INFO:absl:Field info.splits from disk and from code do not match. Keeping the one from code.
INFO:absl:Field info.supervised_keys from disk and from code do not match. Keeping the one from code.
INFO:absl:Field info.module_name from disk and from code do not match. Keeping the one from code.
INFO:absl:Generating dataset mnist (/root/tensorflow_datasets/mnist/3.0.1)
INFO:absl:Dataset mnist is hosted on GCS. It will automatically be downloaded to your
local data directory. If you’d instead prefer to read directly from our public
GCS bucket (recommended if you’re running on GCP), you can instead pass
try_gcs=True
to tfds.load
or set data_dir=gs://tfds-data/datasets
.
INFO:absl:Load dataset info from /root/tensorflow_datasets/mnist/3.0.1.incomplete4BGB9B
INFO:absl:Field info.citation from disk and from code do not match. Keeping the one from code.
##
##-The tfjob.yaml file contents are:
student_01_80a7d583abb6@cloudshell:~/lab-files (qwiklabs-gcp-02-747e5620eebe)$ more tfjob.yaml
apiVersion: kubeflow.org/v1
kind: TFJob
metadata: # kpt-merge: /multi-worker
name: multi-worker
spec:
cleanPodPolicy: None
tfReplicaSpecs:
Worker:
replicas: 3
template:
spec:
containers:
- name: tensorflow
image: Google Cloud Platform
args:
- --epochs=5
- --steps_per_epoch=100
- --per_worker_batch=64
- --saved_model_path=gs://gcr.io/qwiklabs-gcp-02-747e5620eebe/bucket/saved_model_dir
- --checkpoint_path=gs://gcr.io/qwiklabs-gcp-02-747e5620eebe/bucket/checkpoints
##
##-TFJob successfully created
Status:
Conditions:
Last Transition Time: 2022-05-16T18:57:00Z
Last Update Time: 2022-05-16T18:57:00Z
Message: TFJob multi-worker is created.
Reason: TFJobCreated
Status: True
Type: Created
Replica Statuses:
Worker:
Start Time: 2022-05-16T18:57:00Z
Events:
Type Reason Age From Message
Normal SuccessfulCreatePod 14s tf-operator Created pod: multi-worker-worker-0
Normal SuccessfulCreatePod 14s tf-operator Created pod: multi-worker-worker-1
Normal SuccessfulCreatePod 14s tf-operator Created pod: multi-worker-worker-2
Normal SuccessfulCreateService 14s tf-operator Created service: multi-worker-worker-0
Normal SuccessfulCreateService 14s tf-operator Created service: multi-worker-worker-1
Normal SuccessfulCreateService 14s tf-operator Created service: multi-worker-worker-2
##
##-TFJob status switches to Running
Status:
Conditions:
Last Transition Time: 2022-05-16T18:57:00Z
Last Update Time: 2022-05-16T18:57:00Z
Message: TFJob multi-worker is created.
Reason: TFJobCreated
Status: True
Type: Created
Last Transition Time: 2022-05-16T18:57:23Z
Last Update Time: 2022-05-16T18:57:23Z
Message: TFJob multi-worker is running.
Reason: TFJobRunning
Status: True
Type: Running
Replica Statuses:
Worker:
Active: 3
Start Time: 2022-05-16T18:57:00Z
Events:
Type Reason Age From Message
Normal SuccessfulCreatePod 29s tf-operator Created pod: multi-worker-worker-0
Normal SuccessfulCreatePod 29s tf-operator Created pod: multi-worker-worker-1
Normal SuccessfulCreatePod 29s tf-operator Created pod: multi-worker-worker-2
Normal SuccessfulCreateService 29s tf-operator Created service: multi-worker-worker-0
Normal SuccessfulCreateService 29s tf-operator Created service: multi-worker-worker-1
Normal SuccessfulCreateService 29s tf-operator Created service: multi-worker-worker-2
##
##-TFJob status switches to Failed due to worker not available
Status:
Completion Time: 2022-05-16T18:57:41Z
Conditions:
Last Transition Time: 2022-05-16T18:57:00Z
Last Update Time: 2022-05-16T18:57:00Z
Message: TFJob multi-worker is created.
Reason: TFJobCreated
Status: True
Type: Created
Last Transition Time: 2022-05-16T18:57:23Z
Last Update Time: 2022-05-16T18:57:23Z
Message: TFJob multi-worker is running.
Reason: TFJobRunning
Status: False
Type: Running
Last Transition Time: 2022-05-16T18:57:41Z
Last Update Time: 2022-05-16T18:57:41Z
Message: TFJob multi-worker has failed because 1 Worker replica(s) failed.
Reason: TFJobFailed
Status: True
Type: Failed
Replica Statuses:
Worker:
Active: 2
Failed: 1
Start Time: 2022-05-16T18:57:00Z
Events:
Type Reason Age From Message
Normal SuccessfulCreatePod 67s tf-operator Created pod: multi-worker-worker-0
Normal SuccessfulCreatePod 67s tf-operator Created pod: multi-worker-worker-1
Normal SuccessfulCreatePod 67s tf-operator Created pod: multi-worker-worker-2
Normal SuccessfulCreateService 67s tf-operator Created service: multi-worker-worker-0
Normal SuccessfulCreateService 67s tf-operator Created service: multi-worker-worker-1
Normal SuccessfulCreateService 67s tf-operator Created service: multi-worker-worker-2
Normal ExitedWithCode 26s tf-operator Pod: default.multi-worker-worker-1 exited with code 1
Normal TFJobFailed 26s tf-operator TFJob multi-worker has failed because 1 Worker replica(s) failed.