Distributed Multi-worker TensorFlow Training on Kubernetes

Hello team, I need some help.
when submitting “strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()” i receive the error message "-bash: syntax error near unexpected token `(’

see upload

pls. advise.

On top, the instruction says: Review the files. How am I supposed to to this?

I appreciate your help, thanx in advance

hi @jugrimmer , looks like you are trying to run Python code in a Bash shell, which does not allow parentheses as is. What the instructions suggest is for the content of main.py, it is not the way to submit the job.

Hope that helps,
Cuong

1 Like

Hi @tranvinhcuong ,

I’m unable to complete the lab ’ Distributed Multi-worker TensorFlow Training on Kubernetes’ due to a persistent error. At the ‘Submit TFJob’ stage, the pods get into an error stage.

I’ve ensured that the image, saved_model_path and checkpoint are correctly updated in tfjob.yaml. Unable to figure out why this error is occuring.

Hi Mrinal! Welcome to Discourse! Looking at your YAML file, it seems you forgot to append a -bucket to your saved model and checkpoint paths. That’s why you’re getting the error. See the example YAML file in the instructions to see what I mean. Hope this helps!

1 Like

I have now done this lab 4 times and receiving an error at the same stage involving the ‘kubectl get pods’ command.
My output initially indicated that only the chief worker was running and the other 2 workers have been producing ‘ERROR’ now after re-running, the chief worker’s logs don’t show and my 2 other workers’ logs are still indicating ‘ERROR’. Help
image|690x381


Here is a snapshot of my tfjob.yaml file and the current output of ‘kubectl get pods’ .

Hi Wangari! Unfortunately, I cannot replicate the error. I redid the lab and got past all checkpoints. If you got all checkpoints right before this one, then the error just might be in this section (Preparing TF Job). Can you also post the output of this command? Please wait one minute after applying the tfjob.yaml. We might see a useful message there.

JOB_NAME=multi-worker
kubectl describe tfjob $JOB_NAME

You can also post a screenshot of what you see when you visit the URL output of gcloud container images list (e.g.gcr.io/qwiklabs-gcp-03-fbe259d78381/mnist-train). It should look something like below. Just make sure to switch to the login credentials of Qwiklabs. If you open this in a new tab, then you might inadvertently switch to your own Gmail account.

Hope we’ll see the issue in these steps. By the way, just in case you exceed your quota, you can ask for an extension via the Qwiklabs support chat.

Let us know how it goes! Thanks!

Hi, it finally worked. I think the problem was that I was not giving the tfjob.yaml job file time to be updated. The other thing I could think of was perhaps that there were typos when typing my bucket name. All this to say it finally worked, Thanks!