Distributed Multi-worker TensorFlow Training on Kubernetes Cant work properly

Hi all,

When during this assignments I can’t go after this step. When I type below command in command prompt after that nothing happens. I can’t figure out what is the actual problems. Hope you will help me to solve this issue… Thank you

Here is command
gcloud beta container clusters create $CLUSTER_NAME
–project=$PROJECT_ID
–cluster-version=latest
–machine-type=n1-standard-4
–scopes compute-rw,gke-default,storage-rw
–num-nodes=3

Hi Noro! Welcome to Discourse! Let me give this a shot and will get back to you as soon as I can. Thanks!

Thank you Chris, I am waiting of this solution.

Hi Noro. Sorry it took a while. I will try the lab now.

Hello again! Unfortunately, I could not replicate the issue. The command ran normally and it was done creating the cluster in about 5 to 10 minutes. The screenshot below show that the environment variables were created and then I was able to execute the command you mentioned:

The only hiccup I encountered is Qwiklabs didn’t start the lab in the student account. I had to switch my profile to the email they gave. Please see the troubleshooting tip here for details.

If you retried the lab and the cluster creation is still not pushing through after 10 minutes. Please consult with the Qwiklabs agents using their support channels so they can check possible problems with your setup. The support channels are on the upper right and the chat option is usually responsive:

Hope these help and you’re able to proceed with the lab.

Thank you so much for your solution. I will follow it. I am so grateful for your quick response. Have a great day…

Sorry to say that again no improvement. I almost wait one hour. Can you check my screenshots. Maybe If i did wrong. I am attaching this. And I already send my problem to qwiklabs also.

Hi Noro. That is indeed strange. Did you try to press Enter multiple times? There should be a few warnings after you issue that command. The cursor should not be stuck at num_nodes=3 . Hopefully, Qwiklabs can help.

1 Like

Hi, Can someone help me with this issue


To which version should I change it to ?

Hi All,
I am stuck in the TFJob custom resource installation step with the error : "unable to recognize “tf-training/tf-job-crds/base”: no matches for kind “CustomResourceDefinition” in version “apiextensions.k8s.io/v1beta1"”


Can someone help me with this error?..

I’m having the same problem.
Tried changing the versions inside the YAML files to v1, but without success.

Hi,

Some can help me, the labs give me error500, these labs need to be corrected, it has too many errors. Now I cannot finish the labs with this error500.

I have the same issue when I launch the command ''kubectl apply --kustomize tf-training/tf-job-crds/base" it returns

“error: resource mapping not found for name: “tfjobs.kubeflow.org” namespace: “” from “tf-training/tf-job-crds/base”: no matches for kind “CustomResourceDefinition” in version “apiextensions.k8s.io/v1beta1
ensure CRDs are installed first”