C3W3 Lab 2 Distributed Multi-worker TensorFlow Training on Kubernetes Help

Hi everyone,

When submitting the TFJob (with the correct tfjob.yaml manifest), it seems that the 3 workers fail almost immediately:

This command (kubectl describe tfjob $JOB_NAME) yields the job description with the following message in STATUS > CONDITIONS > MESSAGE: “TFJob multi-worker has failed because 1 Worker replica(s) failed.”

When running “kubectl get pods”, I get the confirmation that all 3 workers have ERROR as STATUS.

What should I do to solve the issue?

Thank you very much for your help!

Hubert

Below is the event log:

Events:
  Type    Reason                   Age    From         Message
  ----    ------                   ----   ----         -------
  Normal  SuccessfulCreatePod      3m34s  tf-operator  Created pod: multi-worker-worker-0
  Normal  SuccessfulCreatePod      3m34s  tf-operator  Created pod: multi-worker-worker-1
  Normal  SuccessfulCreatePod      3m34s  tf-operator  Created pod: multi-worker-worker-2
  Normal  SuccessfulCreateService  3m34s  tf-operator  Created service: multi-worker-worker-0
  Normal  SuccessfulCreateService  3m34s  tf-operator  Created service: multi-worker-worker-1
  Normal  SuccessfulCreateService  3m34s  tf-operator  Created service: multi-worker-worker-2
  Normal  ExitedWithCode           2m52s  tf-operator  Pod: default.multi-worker-worker-1 exited with code 1
  Normal  TFJobFailed              2m52s  tf-operator  TFJob multi-worker has failed because 1 Worker replica(s) failed.

The problem was actually solved quite easily.

I forgot to write in the first message that there was an error in the updated job manifest (tfjob.yaml), which probably made the first job submission trial fail. After I corrected the error I tried to resubmit without success. But deleting the pods (using the command from the lab) and resubmitting after that worked just fine. [I deleted this post by error]

To better help people after you, please share the full change you made instead of just writing “I made an error and fixed it”

Hello,

Can you specify how you fix that error? I am not able to find where the mistake is.

Regards

1 Like