W3_Graded Lab__Distributed Multi-worker TensorFlow Training on Kubernetes
The status of all pods does not change to Running.
(70/100)
kubectl get pods
Notice that the pods are named using the following convention [JOB_NAME]-worker-[WORKER_INDEX].
Wait till the status of all pods changes to Running.
To retrieve the logs for the chief (worker 0) execute the following command. It will continue streaming the logs till the training program completes.
student_04_594954e1dbf0@cloudshell:~/lab-files (qwiklabs-gcp-03-5790d7d6f5d1) kubectl get pods
NAME READY STATUS RESTARTS AGE
multi-worker-worker-0 0/1 ImagePullBackOff 0 5m5s
multi-worker-worker-1 0/1 ImagePullBackOff 0 5m5s
multi-worker-worker-2 0/1 ImagePullBackOff 0 5m5s
student_04_594954e1dbf0@cloudshell:~/lab-files (qwiklabs-gcp-03-5790d7d6f5d1) kubectl get pods
NAME READY STATUS RESTARTS AGE
multi-worker-worker-0 0/1 ImagePullBackOff 0 7m26s
multi-worker-worker-1 0/1 ImagePullBackOff 0 7m26s
multi-worker-worker-2 0/1 ImagePullBackOff 0 7m26s
student_04_594954e1dbf0@cloudshell:~/lab-files (qwiklabs-gcp-03-5790d7d6f5d1) kubectl get pods
NAME READY STATUS RESTARTS AGE
multi-worker-worker-0 0/1 ImagePullBackOff 0 11m
multi-worker-worker-1 0/1 ImagePullBackOff 0 11m
multi-worker-worker-2 0/1 ImagePullBackOff 0 11m
student_04_594954e1dbf0@cloudshell:~/lab-files (qwiklabs-gcp-03-5790d7d6f5d1) kubectl get pods
NAME READY STATUS RESTARTS AGE
multi-worker-worker-0 0/1 ImagePullBackOff 0 17m
multi-worker-worker-1 0/1 ImagePullBackOff 0 17m
multi-worker-worker-2 0/1 ImagePullBackOff 0 17m
student_04_594954e1dbf0@cloudshell:~/lab-files (qwiklabs-gcp-03-5790d7d6f5d1)$ kubectl get pods
NAME READY STATUS RESTARTS AGE
multi-worker-worker-0 0/1 ImagePullBackOff 0 37m
multi-worker-worker-1 0/1 ImagePullBackOff 0 37m
multi-worker-worker-2 0/1 ImagePullBackOff 0 37m
student_04_594954e1dbf0@cloudshell:~/lab-files (qwiklabs-gcp-03-5790d7d6f5d1) kubectl logs --follow {JOB_NAME}-worker-0
Error from server (BadRequest): container “tensorflow” in pod “multi-worker-worker-0” is waiting to start: trying and failing to pull image
student_04_594954e1dbf0@cloudshell:~/lab-files (qwiklabs-gcp-03-5790d7d6f5d1) kubectl logs --follow {JOB_NAME}-worker-0
Error from server (BadRequest): container “tensorflow” in pod “multi-worker-worker-0” is waiting to start: trying and failing to pull image