C3w3 lab syntax error

When I paste this into the lab
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
task_type = strategy.cluster_resolver.task_type
task_id = strategy.cluster_resolver.task_id
global_batch_size = per_worker_batch * strategy.num_replicas_in_sync

I get this error:
student_03_ea73524ff83b@cloudshell:~/lab-files (qwiklabs-gcp-03-48cca648bd05)$ strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
-bash: syntax error near unexpected token `(’

Tried entering line by line but did not work

Hi @KET1
are you running that Tensorflow statement (strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()) from the bash shell?
i cannot understand from the output
BR

yes I am running it from the google shell exactly as it is stated in the lab.

1 Like

Ok I need to review the lab.
Br

Hi @KET1
I understood the problem
Reading from the lab

" The training module is in the mnist folder. The model.py file contains a function to create a simple convolutional network. The main.py file contains data preprocessing routines and a distributed training loop. Review the files. Notice how you can use a tf.distribute.experimental.MultiWorkerMirrorStrategy() object to retrieve information about the topology of the distributed cluster running a job."

So the statements you mentions are written in the main.py file. You don’t need to add anything or execute any python instruction. You cannot execute python statements from a bash. The main.py will be run automatically during the training process.
Hope this helps
BR

Thanks for the feedback. I edited the tfjob to

image: gcr.io/qwiklabs-gcp-01-d09a99f53d35/mnist-train

args:

  • –epochs=5

  • –steps_per_epoch=100

  • –per_worker_batch=64

  • –saved_model_path=gs://qwiklabs-gcp-01-d09a99f53d35-bucket/saved_model_dir

  • –checkpoint_path=gs://qwiklabs-gcp-01-d09a99f53d35-bucket/checkpoints

but when I run: kubectl logs --follow ${JOB_NAME}-worker-0

I get:

Error from server (BadRequest): container “tensorflow” in pod “multi-worker-worker-0” is waiting to start: trying and failing to pull image

Do I need to wait or have I done something wrong?

For gcloud container images list

I get:

gcr.io/qwiklabs-gcp-00-da05ff11cc06/mnist-train

All seems good. Unless something wrong with my paths

Hi @KET1
The tfjob seems to be correct.
It sounds as if you have a dirty environment. In my experience I did an error in the tfjob file caused the following statements to fail.
May you share the overall tfjob file?
Are you to sure to have executed all the statements after the tfjob has been modified?
If yes please restart the lab from the beginning.
Br

I restarted again same: I am getting 70/100 at this point so everything should be fine

student_03_ea73524ff83b@cloudshell:~/lab-files (qwiklabs-gcp-01-75ca4c30e1b2) kubectl get pods NAME READY STATUS RESTARTS AGE multi-worker-worker-0 0/1 ImagePullBackOff 0 23s multi-worker-worker-1 0/1 ImagePullBackOff 0 23s multi-worker-worker-2 0/1 ImagePullBackOff 0 23s student_03_ea73524ff83b@cloudshell:~/lab-files (qwiklabs-gcp-01-75ca4c30e1b2) kubectl logs --follow ${JOB_NAME}-worker-0

apiVersion: kubeflow.org/v1

kind: TFJob

metadata: # kpt-merge: /multi-worker

name: multi-worker

spec:

cleanPodPolicy: None

tfReplicaSpecs:

Worker:

replicas: 3

template:

spec:

containers:

  • name: tensorflow

image: gcr.io/qwiklabs-gcp-01-d09a99f53d35/mnist-train

args:

  • –epochs=5

  • –steps_per_epoch=100

  • name: tensorflow

image: gcr.io/qwiklabs-gcp-01-d09a99f53d35/mnist-train

args:

  • –epochs=5

  • –steps_per_epoch=100

  • –per_worker_batch=64

  • –saved_model_path=gs://qwiklabs-gcp-01-d09a99f53d35-bucket/saved_model_dir

  • –checkpoint_path=gs://qwiklabs-gcp-01-d09a99f53d35-bucket[/checkpoints

Here is tfjob.yaml:

Hi @KET1
Just for comparison, here below my tfjob.yaml file

apiVersion: kubeflow.org/v1
kind: TFJob
metadata: # kpt-merge: /multi-worker
  name: multi-worker
spec:
  cleanPodPolicy: None
  tfReplicaSpecs:
    Worker:
      replicas: 3
      template:
        spec:
          containers:
            - name: tensorflow
              image: gcr.io/qwiklabs-gcp-02-fe736ea69f19/mnist-train
              args:
                - --epochs=5
                - --steps_per_epoch=100
                - --per_worker_batch=64
                - --saved_model_path=gs://qwiklabs-gcp-02-fe736ea69f19-bucket/saved_model_dir
                - --checkpoint_path=gs://qwiklabs-gcp-02-fe736ea69f19-bucket/checkpoints

Maybe I’m wrong but in your tfjob file I see the ‘args’ section twice.

I have run the command

kubectl describe tfjob $JOB_NAME
Name:         multi-worker
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v1
Kind:         TFJob
Metadata:
  Creation Timestamp:  2021-09-24T04:13:39Z
  Generation:          1
  Managed Fields:
    API Version:  kubeflow.org/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:cleanPodPolicy:
        f:tfReplicaSpecs:
          .:
          f:Worker:
            .:
            f:replicas:
            f:template:
              .:
              f:spec:
    Manager:      kubectl-client-side-apply
    Operation:    Update
    Time:         2021-09-24T04:13:39Z
    API Version:  kubeflow.org/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        f:successPolicy:
        f:tfReplicaSpecs:
          f:Worker:
            f:restartPolicy:
            f:template:
              f:metadata:
                .:
                f:creationTimestamp:
              f:spec:
                f:containers:
      f:status:
        .:
        f:completionTime:
        f:conditions:
        f:replicaStatuses:
          .:
          f:Worker:
            .:
            f:succeeded:
        f:startTime:
    Manager:         tf-operator.v1
    Operation:       Update
    Time:            2021-09-24T04:15:57Z
  Resource Version:  5491
  UID:               66639814-ff27-4a67-9c23-e90e8fc9c265
Spec:
  Clean Pod Policy:  None
  Tf Replica Specs:
    Worker:
      Replicas:  3
      Template:
        Spec:
          Containers:
            Args:
              --epochs=5
              --steps_per_epoch=100
              --per_worker_batch=64
              --saved_model_path=gs://qwiklabs-gcp-02-fe736ea69f19-bucket/saved_model_dir
              --checkpoint_path=gs://qwiklabs-gcp-02-fe736ea69f19-bucket/checkpoints
            Image:  gcr.io/qwiklabs-gcp-02-fe736ea69f19/mnist-train
            Name:   tensorflow
Status:
  Completion Time:  2021-09-24T04:15:57Z
  Conditions:
    Last Transition Time:  2021-09-24T04:13:39Z
    Last Update Time:      2021-09-24T04:13:39Z
    Message:               TFJob multi-worker is created.
    Reason:                TFJobCreated
    Status:                True
    Type:                  Created
    Last Transition Time:  2021-09-24T04:14:17Z
    Last Update Time:      2021-09-24T04:14:17Z
    Message:               TFJob multi-worker is running.
    Reason:                TFJobRunning
    Status:                False
    Type:                  Running
    Last Transition Time:  2021-09-24T04:15:57Z
    Last Update Time:      2021-09-24T04:15:57Z
    Message:               TFJob multi-worker successfully completed.
    Reason:                TFJobSucceeded
    Status:                True
    Type:                  Succeeded
  Replica Statuses:
    Worker:
      Succeeded:  3
  Start Time:     2021-09-24T04:13:39Z
Events:
  Type    Reason                   Age                From         Message
  ----    ------                   ----               ----         -------
  Normal  SuccessfulCreatePod      2m49s              tf-operator  Created pod: multi-worker-worker-0
  Normal  SuccessfulCreatePod      2m49s              tf-operator  Created pod: multi-worker-worker-1
  Normal  SuccessfulCreatePod      2m49s              tf-operator  Created pod: multi-worker-worker-2
  Normal  SuccessfulCreateService  2m49s              tf-operator  Created service: multi-worker-worker-0
  Normal  SuccessfulCreateService  2m49s              tf-operator  Created service: multi-worker-worker-1
  Normal  SuccessfulCreateService  2m49s              tf-operator  Created service: multi-worker-worker-2
  Normal  ExitedWithCode           31s (x5 over 35s)  tf-operator  Pod: default.multi-worker-worker-2 exited with code 0
  Normal  ExitedWithCode           31s (x3 over 34s)  tf-operator  Pod: default.multi-worker-worker-1 exited with code 0
  Normal  ExitedWithCode           31s                tf-operator  Pod: default.multi-worker-worker-0 exited with code 0
  Normal  TFJobSucceeded           31s                tf-operator  TFJob multi-worker successfully completed.

Do you have a similar result? Maybe you can see some errors in your output.
Please take a look at the output of the command:

kubectl logs --follow ${JOB_NAME}-worker-0

BR

1 Like

Only one args, seems no error until it wants to pull the image which I have listed at the bottom of the paste. It says container is waiting to start!?

File Edit Options Buffers Tools Help

tfReplicaSpecs:

Worker:

replicas: 3

template:

spec:

containers:

  • name: tensorflow

image: gcr.io/qwiklabs-gcp-01-faa3daf3cd19/mnist-train

args:

  • –epochs=5

  • –steps_per_epoch=100

  • –per_worker_batch=64

  • –saved_model_path=gs://qwiklabs-gcp-01-faa3daf3cd19-bucket/saved_model_dir

  • –checkpoint_path=gs://qwiklabs-gcp-01-faa3daf3cd19-bucket/checkpoints

Status:

student_03_ea73524ff83b@cloudshell:~/lab-files (qwiklabs-gcp-01-faa3daf3cd19)$ kubectl describe tfjob $JOB_NAME

Name: multi-worker

Namespace: default

Labels:

Annotations:

API Version: kubeflow.org/v1

Kind: TFJob

Metadata:

Creation Timestamp: 2021-09-24T16:45:06Z

Generation: 1

Managed Fields:

API Version: kubeflow.org/v1

Fields Type: FieldsV1

fieldsV1:

f:metadata:

f:annotations:

.:

f:kubectl.kubernetes.io/last-applied-configuration:

f:spec:

.:

f:cleanPodPolicy:

f:tfReplicaSpecs:

.:

f:Worker:

.:

f:replicas:

f:template:

.:

f:spec:

Manager: kubectl-client-side-apply

Operation: Update

Time: 2021-09-24T16:45:06Z

API Version: kubeflow.org/v1

Fields Type: FieldsV1

fieldsV1:

f:spec:

f:successPolicy:

f:tfReplicaSpecs:

f:Worker:

f:restartPolicy:

f:template:

f:metadata:

.:

f:creationTimestamp:

f:spec:

f:containers:

f:status:

.:

f:conditions:

f:replicaStatuses:

.:

f:Worker:

f:startTime:

Manager: tf-operator.v1

Operation: Update

Time: 2021-09-24T16:45:06Z

Resource Version: 4405

UID: 964a2df9-240f-4058-8eb1-93a460367fc7

Spec:

Clean Pod Policy: None

Tf Replica Specs:

Worker:

Replicas: 3

Template:

Spec:

Containers:

Args:

–epochs=5

–steps_per_epoch=100

–per_worker_batch=64

–saved_model_path=gs://qwiklabs-gcp-01-faa3daf3cd19-bucket/saved_model_dir

–checkpoint_path=gs://qwiklabs-gcp-01-faa3daf3cd19-bucket/checkpoints

Image: gcr.io/qwiklabs-gcp-01-faa3daf3cd19/mnist-train

Name: tensorflow

Status:

Conditions:

Last Transition Time: 2021-09-24T16:45:06Z

Last Update Time: 2021-09-24T16:45:06Z

Message: TFJob multi-worker is created.

Reason: TFJobCreated

Status: True

Type: Created

Replica Statuses:

Worker:

Start Time: 2021-09-24T16:45:06Z

Events:

Type Reason Age From Message


Normal SuccessfulCreatePod 25s tf-operator Created pod: multi-worker-worker-0

Normal SuccessfulCreatePod 25s tf-operator Created pod: multi-worker-worker-1

Normal SuccessfulCreatePod 25s tf-operator Created pod: multi-worker-worker-2

Normal SuccessfulCreateService 25s tf-operator Created service: multi-worker-worker-0

Normal SuccessfulCreateService 25s tf-operator Created service: multi-worker-worker-1

Normal SuccessfulCreateService 25s tf-operator Created service: multi-worker-worker-2

student_03_ea73524ff83b@cloudshell:~/lab-files (qwiklabs-gcp-01-faa3daf3cd19) kubectl logs --follow {JOB_NAME}-worker-0

Error from server (BadRequest): container “tensorflow” in pod “multi-worker-worker-0” is waiting to start: trying and failing to pull image

student_03_ea73524ff83b@cloudshell:~/lab-files (qwiklabs-gcp-01-faa3daf3cd19) kubectl logs {JOB_NAME}-worker-1

Error from server (BadRequest): container “tensorflow” in pod “multi-worker-worker-1” is waiting to start: image can’t be pulled

student_03_ea73524ff83b@cloudshell:~/lab-files (qwiklabs-gcp-01-faa3daf3cd19)$

HI @KET1
please verify the output of the docker statements

IMAGE_NAME=mnist-train
docker build -t gcr.io/${PROJECT_ID}/${IMAGE_NAME} .
docker push gcr.io/${PROJECT_ID}/${IMAGE_NAME}

Maybe that the image has not been properly built.
I have no problem and the build/push commands were succesfully executed.
In order to show the images built run the following command from the shell:

student_04_7c511cf9d6e2@cloudshell:~/lab-files (qwiklabs-gcp-04-5a564b22b256)$ docker  image ls
REPOSITORY                                        TAG       IMAGE ID       CREATED         SIZE
gcr.io/qwiklabs-gcp-04-5a564b22b256/mnist-train   latest    bcb6bcc00c7a   2 minutes ago   1.59GB
tensorflow/tensorflow                             2.4.1     45872ba1e662   8 months ago    1.57GB

BR

I tried to run it gain. It says my quota has exceeded this lab!? Can we have that reset?

Hi
This problem occurred many times in the past. Take a look at this link
Br

Thanks, finally worked. Maybe when they reset my lab it cleaned up the environment from an error I made. I did build/push in the past but it is impossible for me to tell what the problem was. Thanks again.

1 Like

Thank you so much, today i got same errors and try multiple things but finally your solutions work for me.

I had the exact same problem. I tried to run the lab a few times and it still got stuck on the last step. How did you clean up your environment to get passed it?

$ kubectl get pods
NAME READY STATUS RESTARTS AGE
multi-worker-worker-0 0/1 Error 0 10m
multi-worker-worker-1 0/1 Error 0 10m
multi-worker-worker-2 0/1 Error 0 10m

the end of the multi-worker-worker-0 log is:

UnavailableError: [Derived]Collective ops is aborted by: cluster check alive failed, /job:worker/replica:0/task:1 is down
The error could be from a previous operation. Restart your program to reset. [Op:CollectiveBcastSend]

Here is the pod log:
student_00_63b150593fcd@cloudshell:~/lab-files (qwiklabs-gcp-02-b7d2dab03213) kubectl get pods NAME READY STATUS RESTARTS AGE multi-worker-worker-0 0/1 Error 0 2m38s multi-worker-worker-1 0/1 Error 0 2m38s multi-worker-worker-2 0/1 Error 0 2m38s student_00_63b150593fcd@cloudshell:~/lab-files (qwiklabs-gcp-02-b7d2dab03213) kubectl describe pod multi-worker-worker-0
Name: multi-worker-worker-0
Namespace: default
Priority: 0
Node: gke-cluster-1-default-pool-273a708b-vc11/10.128.0.3
Start Time: Mon, 11 Oct 2021 03:11:57 +0000
Labels: controller-name=tf-operator
group-name=kubeflow.org
job-name=multi-worker
job-role=master
tf-job-name=multi-worker
tf-replica-index=0
tf-replica-type=worker
Annotations:
Status: Failed
IP: 10.92.0.7
IPs:
IP: 10.92.0.7
Controlled By: TFJob/multi-worker
Containers:
tensorflow:
Container ID: containerd://85956ac8f49653aa17ba8ebde8316885d3bb507bcc63ee7964228a3832fbc7ce
Image: gcr.io/qwiklabs-gcp-02-b7d2dab03213/mnist-train
Image ID: gcr.io/qwiklabs-gcp-02-b7d2dab03213/mnist-train@sha256:09e9753cd4c0c57e0cc58160c978872515a52456d99d30eef38dae413c284142
Port: 2222/TCP
Host Port: 0/TCP
Args:
–epochs=5
–steps_per_epoch=100
–per_worker_batch=64
–saved_model_path=gs://qwiklabs-gcp-02-b7d2dab03213/saved_model_dir
–checkpoint_path=gs://qwiklabs-gcp-02-b7d2dab03213/checkpoints
State: Terminated
Reason: Error
Exit Code: 1
Started: Mon, 11 Oct 2021 03:11:58 +0000
Finished: Mon, 11 Oct 2021 03:13:01 +0000
Ready: False
Restart Count: 0
Environment:
TF_CONFIG: {“cluster”:{“worker”:[“multi-worker-worker-0.default.svc:2222”,“multi-worker-worker-1.default.svc:2222”,“multi-worker-worker-2.default.svc:2222”]},“task”:{“type”:“worker”,“index”:0},“environment”:“cloud”}
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8dnvk (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-8dnvk:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message


Normal Scheduled 3m49s default-scheduler Successfully assigned default/multi-worker-worker-0 to gke-cluster-1-default-pool-273a708b-vc11
Normal Pulling 3m48s kubelet Pulling image “gcr.io/qwiklabs-gcp-02-b7d2dab03213/mnist-train
Normal Pulled 3m48s kubelet Successfully pulled image “gcr.io/qwiklabs-gcp-02-b7d2dab03213/mnist-train” in 186.526366ms
Normal Created 3m48s kubelet Created container tensorflow
Normal Started 3m48s kubelet Started container tensorflow

I think I found my stupid mistake. It was in tfjob.yaml. I was missing ‘-bucket’ after the project_id.

1 Like

How do I access to the MNIST folder ? Could you share a screenshot?

When I am trying to open the lab-C3W3. I receive an error-Sorry your quota is exceeded for this lab. How do we solve the problem?