MLEP Week3 lab3 worker-1 exited with code 1

Issue:
I’m having issue running the MLEP Week3 lab3 Distributed Multi-worker TensorFlow Training on Kubernetes assignment. GSP775
The workers appear to start but worker-1 exits at start of processing with error:
Pod: default.multi-worker-worker-1 exited with code 1

Request:
Please advise how to resolve this issue.

Background:
Running lab in Chrome Incognito window - MacBook Air-M1 MacOS 12.3.1
The tfjob.yaml was edited using vim

Discussion:
I encountered this same error running lab yesterday. I reported the issue and Qwiklab restored lab access. Running the lab today yielded same result. I believe that I edited the tfjob.yaml file correctly.
Below are the log files excerpts from worker-1. The log for worker-1 complains about cuInit error but web search appears to indicate that this is not related to issue I’m encountering.

##
##-Log entry is:
Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-05-16 18:57:26.974326: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library ‘libcuda.so.1’; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-05-16 18:57:26.974356: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)

##
##-Log later shows XLA complaints and then job starts.
##-Then complaints about “citation from disk and from code do not match”
##-Then job is failed
2022-05-16 18:57:26.976158: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-05-16 18:57:26.988537: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:301] Initialize GrpcChannelCache for job worker → {0 → multi-worker-worker-0.default.svc:2222, 1 → multi-worker-worker-1.default.svc:2222, 2 → multi-worker-worker-2.default.svc:2222}
2022-05-16 18:57:26.989088: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:411] Started server with target: grpc://multi-worker-worker-1.default.svc:2222
INFO:tensorflow:Enabled multi-worker collective ops with available devices: [’/job:worker/replica:0/task:1/device:CPU:0’]
INFO:tensorflow:Using MirroredStrategy with devices (’/job:worker/task:1’,)
INFO:tensorflow:Waiting for the cluster, timeout = inf
INFO:tensorflow:Cluster is ready.
INFO:tensorflow:MultiWorkerMirroredStrategy with cluster_spec = {‘worker’: [‘multi-worker-worker-0.default.svc:2222’, ‘multi-worker-worker-1.default.svc:2222’, ‘multi-worker-worker-2.default.svc:2222’]}, task_type = ‘worker’, task_id = 1, num_workers = 3, local_devices = (’/job:worker/task:1’,), communication = CommunicationImplementation.AUTO
INFO:absl:Load pre-computed DatasetInfo (eg: splits, num examples,…) from GCS: mnist/3.0.1
INFO:absl:Load dataset info from /tmp/tmpoov2q2bdtfds
INFO:absl:Field info.citation from disk and from code do not match. Keeping the one from code.
INFO:absl:Field info.splits from disk and from code do not match. Keeping the one from code.
INFO:absl:Field info.supervised_keys from disk and from code do not match. Keeping the one from code.
INFO:absl:Field info.module_name from disk and from code do not match. Keeping the one from code.
INFO:absl:Generating dataset mnist (/root/tensorflow_datasets/mnist/3.0.1)
INFO:absl:Dataset mnist is hosted on GCS. It will automatically be downloaded to your
local data directory. If you’d instead prefer to read directly from our public
GCS bucket (recommended if you’re running on GCP), you can instead pass
try_gcs=True to tfds.load or set data_dir=gs://tfds-data/datasets.

INFO:absl:Load dataset info from /root/tensorflow_datasets/mnist/3.0.1.incomplete4BGB9B
INFO:absl:Field info.citation from disk and from code do not match. Keeping the one from code.

##
##-The tfjob.yaml file contents are:
student_01_80a7d583abb6@cloudshell:~/lab-files (qwiklabs-gcp-02-747e5620eebe)$ more tfjob.yaml

apiVersion: kubeflow.org/v1
kind: TFJob
metadata: # kpt-merge: /multi-worker
name: multi-worker
spec:
cleanPodPolicy: None
tfReplicaSpecs:
Worker:
replicas: 3
template:
spec:
containers:
- name: tensorflow
image: Google Cloud Platform
args:
- --epochs=5
- --steps_per_epoch=100
- --per_worker_batch=64
- --saved_model_path=gs://gcr.io/qwiklabs-gcp-02-747e5620eebe/bucket/saved_model_dir
- --checkpoint_path=gs://gcr.io/qwiklabs-gcp-02-747e5620eebe/bucket/checkpoints

##
##-TFJob successfully created

Status:
Conditions:
Last Transition Time: 2022-05-16T18:57:00Z
Last Update Time: 2022-05-16T18:57:00Z
Message: TFJob multi-worker is created.
Reason: TFJobCreated
Status: True
Type: Created
Replica Statuses:
Worker:
Start Time: 2022-05-16T18:57:00Z
Events:
Type Reason Age From Message


Normal SuccessfulCreatePod 14s tf-operator Created pod: multi-worker-worker-0
Normal SuccessfulCreatePod 14s tf-operator Created pod: multi-worker-worker-1
Normal SuccessfulCreatePod 14s tf-operator Created pod: multi-worker-worker-2
Normal SuccessfulCreateService 14s tf-operator Created service: multi-worker-worker-0
Normal SuccessfulCreateService 14s tf-operator Created service: multi-worker-worker-1
Normal SuccessfulCreateService 14s tf-operator Created service: multi-worker-worker-2

##
##-TFJob status switches to Running

Status:
Conditions:
Last Transition Time: 2022-05-16T18:57:00Z
Last Update Time: 2022-05-16T18:57:00Z
Message: TFJob multi-worker is created.
Reason: TFJobCreated
Status: True
Type: Created
Last Transition Time: 2022-05-16T18:57:23Z
Last Update Time: 2022-05-16T18:57:23Z
Message: TFJob multi-worker is running.
Reason: TFJobRunning
Status: True
Type: Running
Replica Statuses:
Worker:
Active: 3
Start Time: 2022-05-16T18:57:00Z
Events:
Type Reason Age From Message


Normal SuccessfulCreatePod 29s tf-operator Created pod: multi-worker-worker-0
Normal SuccessfulCreatePod 29s tf-operator Created pod: multi-worker-worker-1
Normal SuccessfulCreatePod 29s tf-operator Created pod: multi-worker-worker-2
Normal SuccessfulCreateService 29s tf-operator Created service: multi-worker-worker-0
Normal SuccessfulCreateService 29s tf-operator Created service: multi-worker-worker-1
Normal SuccessfulCreateService 29s tf-operator Created service: multi-worker-worker-2

##
##-TFJob status switches to Failed due to worker not available

Status:
Completion Time: 2022-05-16T18:57:41Z
Conditions:
Last Transition Time: 2022-05-16T18:57:00Z
Last Update Time: 2022-05-16T18:57:00Z
Message: TFJob multi-worker is created.
Reason: TFJobCreated
Status: True
Type: Created
Last Transition Time: 2022-05-16T18:57:23Z
Last Update Time: 2022-05-16T18:57:23Z
Message: TFJob multi-worker is running.
Reason: TFJobRunning
Status: False
Type: Running
Last Transition Time: 2022-05-16T18:57:41Z
Last Update Time: 2022-05-16T18:57:41Z
Message: TFJob multi-worker has failed because 1 Worker replica(s) failed.
Reason: TFJobFailed
Status: True
Type: Failed
Replica Statuses:
Worker:
Active: 2
Failed: 1
Start Time: 2022-05-16T18:57:00Z
Events:
Type Reason Age From Message


Normal SuccessfulCreatePod 67s tf-operator Created pod: multi-worker-worker-0
Normal SuccessfulCreatePod 67s tf-operator Created pod: multi-worker-worker-1
Normal SuccessfulCreatePod 67s tf-operator Created pod: multi-worker-worker-2
Normal SuccessfulCreateService 67s tf-operator Created service: multi-worker-worker-0
Normal SuccessfulCreateService 67s tf-operator Created service: multi-worker-worker-1
Normal SuccessfulCreateService 67s tf-operator Created service: multi-worker-worker-2
Normal ExitedWithCode 26s tf-operator Pod: default.multi-worker-worker-1 exited with code 1
Normal TFJobFailed 26s tf-operator TFJob multi-worker has failed because 1 Worker replica(s) failed.

Hi Dennis! Welcome to Discourse! I think the error stems from this part of the YAML file:

The project ID should be appended with -bucket instead of /bucket. Kindly revise both lines. Hope it works!

Hi Chris. Thanks for reviewing my issue. I still have issue, although the mnist dataset is now logged as accessed/downloaded. However the workers still fail shortly after switching to Running status. Suggestions appreciated. Details follow…

Issue: (from 16-May-2022)
I’m having issue running the MLEP Week3 lab3 Distributed Multi-worker TensorFlow Training on Kubernetes assignment. GSP775
The workers appear to start but workers exit at start of processing with error code 1.

Update: 17-May-2022
Re-ran lab GSP775 (in Chrome Incognito window)
Corrections for editing tfjob.yaml file provided by mentor were used
Processing proceeded further (mnist files were accessed) but workers still failed

Then re-ran lab GSP775 (in Chrome Standard window).
Corrections for editing tfjob.yaml file provided by mentor were used.
Processing proceeded further (mnist files were accessed) but workers still failed

Request:
Suggestions for how to resolve this issue appreciated

Background:
Running lab in Chrome Version 101.0.4951.54 (Official Build) (arm64)
MacBook Air-M1 MacOS 12.3.1
The tfjob.yaml was edited using vim

Discussion:
Same error encountered when running lab on 16-May-2022.
Thank you mentor for identifying my error in editing tfjob.yaml file
The workers are now created and then start running.
The log file reports:
— Downloading and preparing dataset 11.06 MiB (download: 11.06 MiB, generated: 21.00 MiB, total: 32.06 MiB) to /root/tensorflow_datasets/mnist/3.0.1…
— Dataset mnist downloaded and prepared to /root/tensorflow_datasets/mnist/3.0.1
— Epoch 1/5 starts

Then workers fail ??
Below are the log files excerpts from worker logs

##
##-TFJob status shows workers created:

student_01_31d37cf9bebf@cloudshell:~/lab-files (qwiklabs-gcp-00-125b26431461)$ JOB_NAME=multi-worker
kubectl describe tfjob $JOB_NAME
Name: multi-worker
Namespace: default
Labels:
Annotations:
API Version: kubeflow.org/v1
Kind: TFJob
Metadata:
Creation Timestamp: 2022-05-17T17:21:47Z
Generation: 1
Managed Fields:
API Version: kubeflow.org/v1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:kubectl.kubernetes.io/last-applied-configuration:
f:spec:
.:
f:cleanPodPolicy:
f:tfReplicaSpecs:
.:
f:Worker:
.:
f:replicas:
f:template:
.:
f:spec:
.:
f:containers:
Manager: kubectl-client-side-apply
Operation: Update
Time: 2022-05-17T17:21:47Z
API Version: kubeflow.org/v1
Fields Type: FieldsV1
fieldsV1:
f:status:
.:
f:conditions:
f:replicaStatuses:
.:
f:Worker:
f:startTime:
Manager: tf-operator.v1
Operation: Update
Time: 2022-05-17T17:21:48Z
Resource Version: 5996
UID: 4008a268-085b-46bb-b54d-40b470ff198c
Spec:
Clean Pod Policy: None
Tf Replica Specs:
Worker:
Replicas: 3
Template:
Spec:
Containers:
Args:
–epochs=5
–steps_per_epoch=100
–per_worker_batch=64
–saved_model_path=gs://gcr.io/qwiklabs-gcp-00-125b26431461-bucket/saved_model_dir
–checkpoint_path=gs://gcr.io/qwiklabs-gcp-00-125b26431461-bucket/checkpoints
Image: Google Cloud Platform
Name: tensorflow
Status:
Conditions:
Last Transition Time: 2022-05-17T17:21:47Z
Last Update Time: 2022-05-17T17:21:47Z
Message: TFJob multi-worker is created.
Reason: TFJobCreated
Status: True
Type: Created
Replica Statuses:
Worker:
Start Time: 2022-05-17T17:21:47Z
Events:
Type Reason Age From Message


Normal SuccessfulCreatePod 17s tf-operator Created pod: multi-worker-worker-0
Normal SuccessfulCreatePod 17s tf-operator Created pod: multi-worker-worker-1
Normal SuccessfulCreatePod 17s tf-operator Created pod: multi-worker-worker-2
Normal SuccessfulCreateService 17s tf-operator Created service: multi-worker-worker-0
Normal SuccessfulCreateService 17s tf-operator Created service: multi-worker-worker-1
Normal SuccessfulCreateService 16s tf-operator Created service: multi-worker-worker-2

##
##-TFJob status switches to Running

Status:
Conditions:
Last Transition Time: 2022-05-17T17:21:47Z
Last Update Time: 2022-05-17T17:21:47Z
Message: TFJob multi-worker is created.
Reason: TFJobCreated
Status: True
Type: Created
Last Transition Time: 2022-05-17T17:22:22Z
Last Update Time: 2022-05-17T17:22:22Z
Message: TFJob multi-worker is running.
Reason: TFJobRunning
Status: True
Type: Running
Replica Statuses:
Worker:
Active: 3
Start Time: 2022-05-17T17:21:47Z
Events:
Type Reason Age From Message


Normal SuccessfulCreatePod 43s tf-operator Created pod: multi-worker-worker-0
Normal SuccessfulCreatePod 43s tf-operator Created pod: multi-worker-worker-1
Normal SuccessfulCreatePod 43s tf-operator Created pod: multi-worker-worker-2
Normal SuccessfulCreateService 43s tf-operator Created service: multi-worker-worker-0
Normal SuccessfulCreateService 43s tf-operator Created service: multi-worker-worker-1
Normal SuccessfulCreateService 42s tf-operator Created service: multi-worker-worker-2

##
##-TFJob status shows failed

Status:
Completion Time: 2022-05-17T17:22:30Z
Conditions:
Last Transition Time: 2022-05-17T17:21:47Z
Last Update Time: 2022-05-17T17:21:47Z
Message: TFJob multi-worker is created.
Reason: TFJobCreated
Status: True
Type: Created
Last Transition Time: 2022-05-17T17:22:22Z
Last Update Time: 2022-05-17T17:22:22Z
Message: TFJob multi-worker is running.
Reason: TFJobRunning
Status: False
Type: Running
Last Transition Time: 2022-05-17T17:22:30Z
Last Update Time: 2022-05-17T17:22:30Z
Message: TFJob multi-worker has failed because 1 Worker replica(s) failed.
Reason: TFJobFailed
Status: True
Type: Failed
Replica Statuses:
Worker:
Active: 2
Failed: 1
Start Time: 2022-05-17T17:21:47Z
Events:
Type Reason Age From Message


Normal SuccessfulCreatePod 50s tf-operator Created pod: multi-worker-worker-0
Normal SuccessfulCreatePod 50s tf-operator Created pod: multi-worker-worker-1
Normal SuccessfulCreatePod 50s tf-operator Created pod: multi-worker-worker-2
Normal SuccessfulCreateService 50s tf-operator Created service: multi-worker-worker-0
Normal SuccessfulCreateService 50s tf-operator Created service: multi-worker-worker-1
Normal SuccessfulCreateService 49s tf-operator Created service: multi-worker-worker-2
Normal ExitedWithCode 7s tf-operator Pod: default.multi-worker-worker-2 exited with code 1
Normal TFJobFailed 7s tf-operator TFJob multi-worker has failed because 1 Worker replica(s) failed.

##
##-TFJob log entries

2022-05-17 17:22:25.039027: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:411] Started server with target: grpc://multi-worker-worker-0.default.svc:2222
INFO:tensorflow:Enabled multi-worker collective ops with available devices: [’/job:worker/replica:0/task:0/device:CPU:0’]
INFO:tensorflow:Using MirroredStrategy with devices (’/job:worker/task:0’,)
INFO:tensorflow:Waiting for the cluster, timeout = inf
INFO:tensorflow:Cluster is ready.
INFO:tensorflow:MultiWorkerMirroredStrategy with cluster_spec = {‘worker’: [‘multi-worker-worker-0.default.svc:2222’, ‘multi-worker-worker-1.default.svc:2222’, ‘multi-worker-worker-2.default.svc:2222’]}, task_type = ‘worker’, task_id = 0, num_workers = 3, local_devices = (’/job:worker/task:0’,), communication = CommunicationImplementation.AUTO
INFO:absl:Load pre-computed DatasetInfo (eg: splits, num examples,…) from GCS: mnist/3.0.1
INFO:absl:Load dataset info from /tmp/tmpgj68s6nltfds

2022-05-17 17:22:27.803874: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2022-05-17 17:22:27.804401: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2299995000 Hz
INFO:tensorflow:Collective batch_all_reduce: 1 all-reduces, num_devices = 1, group_size = 3, implementation = AUTO, num_packs = 1
INFO:tensorflow:Collective batch_all_reduce: 1 all-reduces, num_devices = 1, group_size = 3, implementation = AUTO, num_packs = 1
WARNING:tensorflow:/job:worker/replica:0/task:1 seems down, retrying 1/3
WARNING:tensorflow:/job:worker/replica:0/task:1 seems down, retrying 1/3
WARNING:tensorflow:/job:worker/replica:0/task:1 seems down, retrying 2/3
WARNING:tensorflow:/job:worker/replica:0/task:1 seems down, retrying 2/3
ERROR:tensorflow:Cluster check alive failed, /job:worker/replica:0/task:1 is down, aborting collectives: Deadline Exceeded
Additional GRPC error information from remote target /job:worker/replica:0/task:1:
:{“created”:"@1652808205.878845297",“description”:“Deadline Exceeded”,“file”:“external/com_github_grpc_grpc/src/core/ext/filters/deadline/deadline_filter.cc”,“file_line”:69,“grpc_status”:4}
ERROR:tensorflow:Cluster check alive failed, /job:worker/replica:0/task:1 is down, aborting collectives: Deadline Exceeded

Hi Dennis! Sorry I just noticed that you also have a gcr.io/ string that should be removed:

Please compare the format to the one in the instructions:

apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: multi-worker
spec:
  cleanPodPolicy: None
  tfReplicaSpecs:
    Worker:
      replicas: 3
      template:
        spec:
          containers:
          - name: tensorflow
            image: gcr.io/qwiklabs-gcp-01-93af833e6576/mnist-train
            args:
            - --epochs=5
            - --steps_per_epoch=100
            - --per_worker_batch=64
            - --saved_model_path=gs://qwiklabs-gcp-01-93af833e6576-bucket/saved_model_dir
            - --checkpoint_path=gs://qwiklabs-gcp-01-93af833e6576-bucket/checkpoints

Hopefully we don’t miss anything this time and the lab works as expected. If not, please post the results again. Thank you!

Hi Chris,
Your analysis and correction addresses my issue. I have completed the Week3 lab.

“When all else fails - read directions”
Advice I try to give to myself, but my impatience usually wins unfortunately.
Thanks you for your persistence is staying with my issue!!

Good jab,
Dennis

1 Like

Awesome! Glad it works now. Enjoy the rest of the course!