C3W3 Problem in Distributed Multi-worker TensorFlow Training on Kubernetes

I have tried to pass the Distributed Multi-worker TensorFlow Training on Kubernetes lab 3 times and always get the same error.

This is my tjob.yaml:

By running the following command:

And the workers never start:

Am I doing something wrong or skipping a step?

Thanks

I have the same question.

Hi! Please check your saved_model and checkpoint in the YAML file. The -bucket string should be after the project id instead of mnist-train. Please compare it to the example shown in the instructions. Hope that works!

Even after doing that it doesn’t work for me tho. I keep having this error


i have the same error like this.


add more information

Hi Gerald! The screenshot you showed is just a warning so it’s possible that the job is started correctly. Please try running the next commands and see if you are able to train the model in the cluster.

Hi! From your screenshot, it seems that it cannot find the image (i.e. gcr.io/qwiklabs-gcp....). Do you see it when you run gcloud container images list? If not, please make sure that you’ve built and pushed it as mentioned in the instructions. That section has this command sequence:

IMAGE_NAME=mnist-train
docker build -t gcr.io/${PROJECT_ID}/${IMAGE_NAME} .
docker push gcr.io/${PROJECT_ID}/${IMAGE_NAME}

If you’re still running into the same error, please post here the output of this command after applying the tfjob.yaml (as also written in the instructions).

JOB_NAME=multi-worker
kubectl describe tfjob $JOB_NAME

Thanks!

Hi,

I am having the same issue - Here’s the output:



I believe the container was pushed as it can be seen listed, though training is not completed.
Edit: Found that I needed to update the image argument in the yaml file. After repeating the lab I was able to pass

@chris.favila Hi Chris, I’m having the same issue right now. I think I have updated the three arguments correctly, you can see my screenshot


For the image, I passed the image name with my project id;
For saved_model and checkpoint, I used the project name plus -bucket.
I really couldn’t figure out where is the problem, but it seems like the training never started.
After I tried to run the follow-up command but I got this error

So, what should I do to restart the workers? Thanks.

ok, I think I figured out how to solve this problem, you just need to use this command
kubectl delete tfjob $JOB_NAME
to remove the job and the associated pods and recreate workers and run the training.

I don’t know why the previous one was frozen, but after I recreated multi-worker and submit the job, it’s working and finished successfully.

@chris.favila You don’t have to reply to my post anymore, thank you anyway.

Hi Josh! Welcome to the community! I was just typing my reply and saw that you resolved it already. Glad it’s now working!

1 Like

Hi Chris, I’m getting an error when running the command. It says

invalid argument “Google Cloud console” for “-t, --tag” flag: invalid reference format
See ‘docker build --help’.
invalid reference format.

Could you advise me on how to proceed?

Hi Amit! Kindly create a new topic along with the exact command you are running. You can also post a screenshot of the output to make it clearer to the mentors. I’ll also try to take a look. Thanks!