C3W3 Problem while Submitting TF Job

Facing issue kubectl logs --follow ${JOB_NAME}-worker-0

Error from server (BadRequest): container “tensorflow” in pod “multi-worker-worker-0” is waiting to start: trying and failing to pull image

Not able to debug this,

I am able to change the yml file as suggested in other threads:

name: tensorflow
              image: gcr.io/qwiklabs-gcp-01-8c7f6bd2d495/mnist-train
              args:
                - --epochs=5
                - --steps_per_epoch=100
                - --per_worker_batch=64
                - --saved_model_path=gs://qwiklabs-gcp-01-8c7f6bd2d495-bucket/saved_model_dir
                - --checkpoint_path=gs://qwiklabs-gcp-01-8c7f6bd2d495-bucket/checkpoints

I deleted my pod and restarted and it worked!

Hi Sowmiyan! Just saw your post in the other thread. Glad you were able to make it work! This seems like something we should report to Qwiklabs since it looks like you’ve done all previous steps correctly. Haven’t encountered this before. Thanks for sharing the solution!

1 Like

I have done the same , restarted. But still it gives me the same error. 4 attempts already.

Have you checked if the yml file has been updated as mentioned in the lab?

I had seen that feedback in other threads! But even after that, it was not working for me, but when I restarted the pod it worked fine

Yes, I did. As you can from the image. image name , bucket and checkpoint is correct

Yes seems the yml file is correct too :frowning:

I feel bad, because I did it multiple times and I already completed the course but because of this assignment, I was charged today for another round of subscription. The next course will start 20 September. :frowning:

May I know the commands to delete and restart pods? I had the same problem, inspected yaml file many times, started from scratch a few times, still getting the same worker error problem.