Unable to create Cluster in Course 3 Week 3

gcloud container clusters create $CLUSTER_NAME
–project=$PROJECT_ID
–cluster-version=1.21
–machine-type=n1-standard-4
–scopes compute-rw,gke-default,storage-rw
–num-nodes=3

Default change: VPC-native is the default mode during cluster creation for versions greater than 1.21.0-gke.1500. To create advanced routes based clusters, please pass the --no-enable-ip-alias flag
Default change: During creation of nodepools or autoscaling configuration changes for cluster versions greater than 1.24.1-gke.800 a default location policy is applied. For Spot and PVM it defaults to ANY, and for all other VM kinds a BALANCED policy is used. To change the default values use the --location-policy flag.
Note: Your Pod address range (--cluster-ipv4-cidr) can accommodate at most 1008 node(s).
ERROR: (gcloud.container.clusters.create) ResponseError: code=400, message=No valid versions with the prefix “1.21” found.

I found way around it and created with “1.27.2-gke.1200”. Then in part 2, step 3. I m unable to run anything .

code: kubectl apply --kustomize tf-training/tf-job-crds/base

error: resource mapping not found for name: “tfjobs.kubeflow.org” namespace: “” from “tf-training/tf-job-crds/base”: no matches for kind “CustomResourceDefinition” in version “apiextensions.k8s.io/v1beta1
ensure CRDs are installed first

Both errors, I was unable to find online in the forums. Can anyone help on this ?

Hello @Pradheepan_Raghavan

Install CRDs :
kubectl apply -k https://github.com/kubeflow/tf-operator.git/manifests/overlays/v3/namespaced-install/crds

Then try applyin the kustomize configuration :
kubectl apply --kustomize tf-training/tf-job-crds/base

student_01_ac5c54e2f912@cloudshell:~ (qwiklabs-gcp-01-9cfdabdeeb86)$ kubectl apply -k https://github.com/kubeflow/tf-operator.git/manifests/overlays/v3/namespaced-install/crds
error: evalsymlink failure on '/tmp/kustomize-1487791782/manifests/overlays/v3/namespaced-install/crds' : lstat /tmp/kustomize-1487791782/manifests/overlays/v3: no such file or directory

The staff have been informed about this for a fix. Please wait.

I am facing the exact same issue.

it is the same error that I have got…
#1 on tf-jobs-crds
student_00_badacd81ea56@cloudshell:~ (qwiklabs-gcp-02-fd56e9f35053)$ kubectl apply --kustomize tf-training/tf-job-crds/base

error: resource mapping not found for name: “tfjobs.kubeflow.org” namespace: “” from “tf-training/tf-job-crds/base”: no matches for kind “CustomResourceDefinition” in version “apiextensions.k8s.io/v1beta1
ensure CRDs are installed first

#2 on TF-job operator
student_00_badacd81ea56@cloudshell:~ (qwiklabs-gcp-02-fd56e9f35053)$ kubectl apply --kustomize tf-training/tf-job-operator/base

serviceaccount/tf-job-dashboard unchanged

serviceaccount/tf-job-operator unchanged

clusterrole.rbac.authorization.k8s.io/kubeflow-tfjobs-admin configured

clusterrole.rbac.authorization.k8s.io/kubeflow-tfjobs-edit unchanged

clusterrole.rbac.authorization.k8s.io/kubeflow-tfjobs-view unchanged

service/tf-job-operator unchanged

deployment.apps/tf-job-operator unchanged

resource mapping not found for name: “tf-job-operator” namespace: “” from “tf-training/tf-job-operator/base”: no matches for kind “ClusterRole” in version “rbac.authorization.k8s.io/v1beta1

ensure CRDs are installed first

resource mapping not found for name: “tf-job-operator” namespace: “” from “tf-training/tf-job-operator/base”: no matches for kind “ClusterRoleBinding” in version “rbac.authorization.k8s.io/v1beta1

This particular lab is broken, Kubernetes version1.21.x is no longer available on GKE and the CRD for version 1.21 does not match with the CRD for version 1.22 onwards !!

Does not work, getting error:

error: evalsymlink failure on ‘/tmp/kustomize-2810064141/manifests/overlays/v3/namespaced-install/crds’ : lstat /tmp/kustomize-2810064141/manifests/overlays/v3: no such file or directory

This lab is so totally broken, I was able to somehow create the operator and CRD and get through that particular step, but in the last step where we need to create an instance of TFJob and wait for its status to change to succeed is not passing inspite of the TFJob instance status changing to succeeded !! This lab is so broken, and I am totally stuck !!

Same issues here, I changed cluster version to 1.22, but when I try to install the TFJob custom resource, I got an error on versions. I’m completely stuck.

Hi @sanjaypsachdev ,

After getting same error, I have tried to install latest kuberneter version. It was like “1.28.something”. I am getting

Output: error: resource mapping not found for name: "tfjobs.kubeflow.org" namespace: "" from "tf-training/tf-job-crds/base": no matches for kind "CustomResourceDefinition" in version "apiextensions.k8s.io/v1beta1" ensure CRDs are installed first

Dear @vlady,

I have tried “latest” keyword in
–cluster-version=latest

Reference: GKE versioning and support | Google Kubernetes Engine (GKE) | Google Cloud

I’ve tried, but I got this error

error: resource mapping not found for name: “tfjobs.kubeflow.org” namespace: “” from “tf-training/tf-job-crds/base”: no matches for kind “CustomResourceDefinition” in version “apiextensions.k8s.io/v1beta1
ensure CRDs are installed first

Same here, reported in this post:

It does not work, location https://github.com/kubeflow/tf-operator.git/manifests/overlays/v3/namespaced-install/crds does not exist.

I just have reached out to the lab support via provided chat option asking when the lab will be bug-fixed and received the following answer: “I have checked there is currently no ongoing issue found.”. But the person on the chat tried to run the lab instance and found the same issues we found.

Looks like Google does not have well working ticketing system for reported issues…

I had this same error, tried using this and now it’s been running for around 10 minutes…

gcloud container clusters create $CLUSTER_NAME
–project=$PROJECT_ID
–cluster-version=1.27
–machine-type=n1-standard-4
–scopes compute-rw,gke-default,storage-rw
–num-nodes=3

I have a feeling I’m going to run out of time before this works

Cluster creation is taking up to more than 40 mins and then erroring out now in Course 4 assignments.

“Instance ‘gke-cluster-1-default-pool-b0afc355-nfvx’ creation failed: The zone ‘projects/qwiklabs-gcp-02-1446351617c5/zones/us-west1-c’ does not have enough resources available to fulfill the request. Try a different zone, or try again later.”>, <StatusCondition canonicalCode: CanonicalCodeValueValuesEnum(UNAVAILABLE, 15) Update: only able to create with smaller machine type e2-standard but error during verification Please create a GKE cluster cluster-1 with required configurations.

I have a feeling Google is really struggling with resources recently. When I tired to run the “Ungraded Lab - Knowledge Distillation” in Module 3, it was throwing “runtime crashed due to insuficient RAM memory” almost every time I tried to run the notebook from A to Z. It was many times not allowing to use even GPU for the models training. I had to limit number of epochs drastically to be able to reach the end of the lab without crashing…

Currently the lab is unavailable… hopefully it will be up soon

1 Like