C3 W3 Lab - Distributed Multi-worker TensorFlow Training on Kubernetes

victorongsh · July 14, 2022, 9:24am

Hi,
I am facing problem progressing for this lab exercise.
Under " Creating a Cloud Storage bucket", I used the below code (extracted from the shell terminal)

student_00_219a375a9c6d@cloudshell:~ (qwiklabs-gcp-02-3dfc0de86241) export TFJOB_BUCKET={qwiklabs-gcp-02-3dfc0de86241}-bucket
gsutil mb gs://${TFJOB_BUCKET}
Creating gs://gcp-02-3dfc0de86241-bucket/…

student_00_219a375a9c6d@cloudshell:~ (qwiklabs-gcp-02-3dfc0de86241)$ gsutil ls
gs://gcp-02-3dfc0de86241-bucket/

But when I clicked on “Check my progress” , my progress was not verified correct (no green tick) and it indicated “Please create a bucket named ‘qwiklabs-gcp-02-3dfc0de86241-bucket’.”

May I know what am I not doing correctly?

balaji.ambresh · July 14, 2022, 3:06pm

Unless you’ve defined a variable named qwiklabs-gcp-02-3dfc0de86241, you’re going to fail the grader. See the output of gsutil ls. The bucket has a name different from project name as the prefix.

This goes back to variable substitution.
I recommend using ${DEVSHELL_PROJECT_ID} since it should be defined in the cloud shell.

victorongsh · July 15, 2022, 9:11am

Thank you for your reply.
I wanted to try out your suggestion, but it says that I have exceeded my quota for this lab.
May I know how I should go about continuing this lab?

Thank you

balaji.ambresh · July 15, 2022, 1:29pm

Please try to start the assignment from the course lab page. Contact qwiklabs if you don’t have the option to start the lab.

Sebastien_SIME · June 24, 2023, 10:34pm

Hello,

I am stuc at step 5: Task 5. Submitting the TFJob

Here is my YAML file:

apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: multi-worker
spec:
  cleanPodPolicy: None
  tfReplicaSpecs:
    Worker:
      replicas: 3
      template:
        spec:
          containers:
            - name: tensorflow
              image: gcr.io/qwiklabs-gcp-00-22097e7b67c7/mnist-train
              args:
                - --epochs=5
                - --steps_per_epoch=100
                - --per_worker_batch=64
                - --saved_model_path=gs://qwiklabs-gcp-00-22097e7b67c7-bucket/saved_model_dir
                - --checkpoint_path=gs://qwiklabs-gcp-00-22097e7b67c7-bucket/checkpoints

After applying changes, I have the following error:

$ kubectl describe tfjob multi-worker
Name:         multi-worker
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v1
Kind:         TFJob
Metadata:
  Creation Timestamp:  2023-06-24T21:50:21Z
  Generation:          4
  Resource Version:    26337
  UID:                 57051154-c72f-4861-a591-b67ac0317bbe
Spec:
  Clean Pod Policy:  None
  Tf Replica Specs:
    Worker:
      Replicas:  3
      Template:
        Spec:
          Containers:
            Args:
              --epochs=5
              --steps_per_epoch=100
              --per_worker_batch=64
              --saved_model_path=gs://qwiklabs-gcp-00-22097e7b67c7-bucket/saved_model_dir
              --checkpoint_path=gs://qwiklabs-gcp-00-22097e7b67c7-bucket/checkpoints
            Image:  gcr.io/qwiklabs-gcp-00-22097e7b67c7/mnist-train
            Name:   tensorflow
Status:
  Conditions:
    Last Transition Time:  2023-06-24T21:50:22Z
    Last Update Time:      2023-06-24T21:50:22Z
    Message:               Failed to marshal the object to TFJob; the spec is invalid: failed to marshal the object to TFJob
    Reason:                InvalidTFJobSpec
    Status:                True
    Type:                  Failed
  Replica Statuses:        <nil>
Events:
  Type     Reason            Age   From         Message
  ----     ------            ----  ----         -------
  Warning  InvalidTFJobSpec  28m   tf-operator  Failed to marshal the object to TFJob; the spec is invalid: failed to marshal the object to TFJob

Can someone help me please ?

balaji.ambresh · June 25, 2023, 10:47am

I just tried the lab with your yaml file. The lab works as expected.

It’s possible that you might have entered a special character while typing the yaml file. Unfortunately, the log doesn’t provide more details other than telling that the job spec is invalid.
Please try again and contact qwiklabs help (click the question symbol on top right of the lab page) if you don’t deploy the job.

Topic		Replies	Views
I am facing in issue in Distributed Multi-worker TensorFlow Training on Kubernetes Machine Learning Modeling Pipelines in Production	8	530	November 4, 2022
C3W3 - Distributed Multi-worker TensorFlow Training on Kubernetes Machine Learning Modeling Pipelines in Production	4	714	July 30, 2021
C3W3 Problem in Distributed Multi-worker TensorFlow Training on Kubernetes Machine Learning Modeling Pipelines in Production	13	661	February 6, 2023
C3W3_Graded_Lab_Error__PROJECT_ID No such file or directory Machine Learning Modeling Pipelines in Production	2	620	August 4, 2021
C3W3 -Problems with Distributed Multi-worker TensorFlow Training on Kubernetes Machine Learning Modeling Pipelines in Production	5	606	July 21, 2022

C3 W3 Lab - Distributed Multi-worker TensorFlow Training on Kubernetes

Related topics