C3 W3 Lab - Distributed Multi-worker TensorFlow Training on Kubernetes

Hi,
I am facing problem progressing for this lab exercise.
Under " Creating a Cloud Storage bucket", I used the below code (extracted from the shell terminal)

student_00_219a375a9c6d@cloudshell:~ (qwiklabs-gcp-02-3dfc0de86241) export TFJOB_BUCKET={qwiklabs-gcp-02-3dfc0de86241}-bucket
gsutil mb gs://${TFJOB_BUCKET}
Creating gs://gcp-02-3dfc0de86241-bucket/…

student_00_219a375a9c6d@cloudshell:~ (qwiklabs-gcp-02-3dfc0de86241)$ gsutil ls
gs://gcp-02-3dfc0de86241-bucket/

But when I clicked on “Check my progress” , my progress was not verified correct (no green tick) and it indicated “Please create a bucket named ‘qwiklabs-gcp-02-3dfc0de86241-bucket’.”

May I know what am I not doing correctly?

Unless you’ve defined a variable named qwiklabs-gcp-02-3dfc0de86241, you’re going to fail the grader. See the output of gsutil ls. The bucket has a name different from project name as the prefix.

This goes back to variable substitution.
I recommend using ${DEVSHELL_PROJECT_ID} since it should be defined in the cloud shell.

Thank you for your reply.
I wanted to try out your suggestion, but it says that I have exceeded my quota for this lab.
May I know how I should go about continuing this lab?

Thank you

Please try to start the assignment from the course lab page. Contact qwiklabs if you don’t have the option to start the lab.

Hello,

I am stuc at step 5: Task 5. Submitting the TFJob

Here is my YAML file:

apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: multi-worker
spec:
  cleanPodPolicy: None
  tfReplicaSpecs:
    Worker:
      replicas: 3
      template:
        spec:
          containers:
            - name: tensorflow
              image: gcr.io/qwiklabs-gcp-00-22097e7b67c7/mnist-train
              args:
                - --epochs=5
                - --steps_per_epoch=100
                - --per_worker_batch=64
                - --saved_model_path=gs://qwiklabs-gcp-00-22097e7b67c7-bucket/saved_model_dir
                - --checkpoint_path=gs://qwiklabs-gcp-00-22097e7b67c7-bucket/checkpoints

After applying changes, I have the following error:

$ kubectl describe tfjob multi-worker
Name:         multi-worker
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v1
Kind:         TFJob
Metadata:
  Creation Timestamp:  2023-06-24T21:50:21Z
  Generation:          4
  Resource Version:    26337
  UID:                 57051154-c72f-4861-a591-b67ac0317bbe
Spec:
  Clean Pod Policy:  None
  Tf Replica Specs:
    Worker:
      Replicas:  3
      Template:
        Spec:
          Containers:
            Args:
              --epochs=5
              --steps_per_epoch=100
              --per_worker_batch=64
              --saved_model_path=gs://qwiklabs-gcp-00-22097e7b67c7-bucket/saved_model_dir
              --checkpoint_path=gs://qwiklabs-gcp-00-22097e7b67c7-bucket/checkpoints
            Image:  gcr.io/qwiklabs-gcp-00-22097e7b67c7/mnist-train
            Name:   tensorflow
Status:
  Conditions:
    Last Transition Time:  2023-06-24T21:50:22Z
    Last Update Time:      2023-06-24T21:50:22Z
    Message:               Failed to marshal the object to TFJob; the spec is invalid: failed to marshal the object to TFJob
    Reason:                InvalidTFJobSpec
    Status:                True
    Type:                  Failed
  Replica Statuses:        <nil>
Events:
  Type     Reason            Age   From         Message
  ----     ------            ----  ----         -------
  Warning  InvalidTFJobSpec  28m   tf-operator  Failed to marshal the object to TFJob; the spec is invalid: failed to marshal the object to TFJob

Can someone help me please ?

I just tried the lab with your yaml file. The lab works as expected.

It’s possible that you might have entered a special character while typing the yaml file. Unfortunately, the log doesn’t provide more details other than telling that the job spec is invalid.
Please try again and contact qwiklabs help (click the question symbol on top right of the lab page) if you don’t deploy the job.