Hi! Please check your saved_model and checkpoint in the YAML file. The -bucket string should be after the project id instead of mnist-train. Please compare it to the example shown in the instructions. Hope that works!
Hi Gerald! The screenshot you showed is just a warning so it’s possible that the job is started correctly. Please try running the next commands and see if you are able to train the model in the cluster.
Hi! From your screenshot, it seems that it cannot find the image (i.e. gcr.io/qwiklabs-gcp....). Do you see it when you run gcloud container images list? If not, please make sure that you’ve built and pushed it as mentioned in the instructions. That section has this command sequence:
I believe the container was pushed as it can be seen listed, though training is not completed.
Edit: Found that I needed to update the image argument in the yaml file. After repeating the lab I was able to pass
For the image, I passed the image name with my project id;
For saved_model and checkpoint, I used the project name plus -bucket.
I really couldn’t figure out where is the problem, but it seems like the training never started.
After I tried to run the follow-up command but I got this error
ok, I think I figured out how to solve this problem, you just need to use this command
kubectl delete tfjob $JOB_NAME
to remove the job and the associated pods and recreate workers and run the training.
I don’t know why the previous one was frozen, but after I recreated multi-worker and submit the job, it’s working and finished successfully.
@chris.favila You don’t have to reply to my post anymore, thank you anyway.