Distributed Multi-worker TensorFlow Training on Kubernetes Cant work properly

noro · October 7, 2021, 4:38pm

Hi all,

When during this assignments I can’t go after this step. When I type below command in command prompt after that nothing happens. I can’t figure out what is the actual problems. Hope you will help me to solve this issue… Thank you

Here is command
gcloud beta container clusters create $CLUSTER_NAME
–project=$PROJECT_ID
–cluster-version=latest
–machine-type=n1-standard-4
–scopes compute-rw,gke-default,storage-rw
–num-nodes=3

chris.favila · October 7, 2021, 10:35pm

Hi Noro! Welcome to Discourse! Let me give this a shot and will get back to you as soon as I can. Thanks!

noro · October 8, 2021, 7:04am

Thank you Chris, I am waiting of this solution.

chris.favila · October 8, 2021, 1:52pm

Hi Noro. Sorry it took a while. I will try the lab now.

chris.favila · October 8, 2021, 2:28pm

Hello again! Unfortunately, I could not replicate the issue. The command ran normally and it was done creating the cluster in about 5 to 10 minutes. The screenshot below show that the environment variables were created and then I was able to execute the command you mentioned:

The only hiccup I encountered is Qwiklabs didn’t start the lab in the student account. I had to switch my profile to the email they gave. Please see the troubleshooting tip here for details.

If you retried the lab and the cluster creation is still not pushing through after 10 minutes. Please consult with the Qwiklabs agents using their support channels so they can check possible problems with your setup. The support channels are on the upper right and the chat option is usually responsive:

Hope these help and you’re able to proceed with the lab.

noro · October 8, 2021, 2:49pm

Thank you so much for your solution. I will follow it. I am so grateful for your quick response. Have a great day…

noro · October 8, 2021, 4:15pm

Sorry to say that again no improvement. I almost wait one hour. Can you check my screenshots. Maybe If i did wrong. I am attaching this. And I already send my problem to qwiklabs also.

chris.favila · October 8, 2021, 10:17pm

Hi Noro. That is indeed strange. Did you try to press Enter multiple times? There should be a few warnings after you issue that command. The cursor should not be stuck at num_nodes=3 . Hopefully, Qwiklabs can help.

Roshini · February 6, 2022, 2:28pm

Hi, Can someone help me with this issue

To which version should I change it to ?

Roshini · February 8, 2022, 6:14am

Hi All,
I am stuck in the TFJob custom resource installation step with the error : "unable to recognize “tf-training/tf-job-crds/base”: no matches for kind “CustomResourceDefinition” in version “apiextensions.k8s.io/v1beta1"”

Can someone help me with this error?..

PAmerikanos · February 8, 2022, 8:28am

I’m having the same problem.
Tried changing the versions inside the YAML files to v1, but without success.

MolyMalibu · March 11, 2022, 9:21pm

Hi,

Some can help me, the labs give me error500, these labs need to be corrected, it has too many errors. Now I cannot finish the labs with this error500.

josmansanvil · September 29, 2022, 2:37pm

I have the same issue when I launch the command ''kubectl apply --kustomize tf-training/tf-job-crds/base" it returns

“error: resource mapping not found for name: “tfjobs.kubeflow.org” namespace: “” from “tf-training/tf-job-crds/base”: no matches for kind “CustomResourceDefinition” in version “apiextensions.k8s.io/v1beta1”
ensure CRDs are installed first”

Topic		Replies	Views
Sorry, Distributed Multi-worker TensorFlow Training on Kubernetes is currently unavailable Machine Learning Modeling Pipelines in Production	7	518	August 4, 2023
C4w2 graded lab - GKE cluster creation Deploying Machine Learning Models in Production	3	570	November 3, 2021
Can't create clusters in the assignment Data Pipelines with TensorFlow Data Services week-3	3	595	January 16, 2022
C4W2 Graded Assignment - GKE cluster creation failing Deploying Machine Learning Models in Production	7	799	December 23, 2022
C4W2 Graded Lab Error Deploying Machine Learning Models in Production	9	655	April 6, 2022

Distributed Multi-worker TensorFlow Training on Kubernetes Cant work properly

Related topics