C4W2 Assignment Issues as of September 2022: Autoscaling TensorFlow model deployments with TF Serving and Kubernetes

I am writing this post while doing the lab.
Just wanted to summarise all the issues I’m having.
I understand that learning provider is not keen on investing time & effort to keep the lab up-to-date. But hey guys, it’s not a charity. You’re earning money from this.
Don’t keep it up-to-date, make it robust and future-proof instead.
You’re supposed to teach us best practices, but instead doing these silly mistakes that derail us from learning valuable knowledge.

General tips

  • Instead of invoking kubectl get ... continuously use kubectl get --watch ..., and it will update the output on any change happening under the hood, i.e.: kubectl get --watch svc image-classifier

Task 2. Creating a GKE cluster

Issue: Project ID is not populated by default.

PROJECT_ID=$(gcloud config get-value project)
ERROR: (gcloud.config.set) The required property [project] is not currently set.
It can be set on a per-command basis by re-running your command with the [--project] flag.

You may set it for your current workspace by running:

  $ gcloud config set project VALUE

or it can be set temporarily by the environment variable [CLOUDSDK_CORE_PROJECT]

Solution:
Copy GCP Project ID from Qwiklabs and use it in the following command:

gcloud config set project qwiklabs-gcp-00-xxxxxxx

Then retry failing command from lab instructions

Task 5. Creating TensorFlow Serving deployment

Problem:
No pods are ready

kubectl get deployments --watch
NAME               READY   UP-TO-DATE   AVAILABLE   AGE
image-classifier   0/1     1            0           2m19s

Root cause can be found in logs:

kubectl logs deploy/image-classifier
2022-09-24 16:11:25.621472: I tensorflow_serving/model_servers/server.cc:89] Building single TensorFlow model file config:  model_name: image_classifier model_base_path: gs://qwiklabs-gcp-00-874de3ad0ee1-bucket/resnet_101
2022-09-24 16:11:25.621835: I tensorflow_serving/model_servers/server_core.cc:465] Adding/updating models.
ce-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-09-24 16:11:27.219326: I external/org_tensorflow/tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
2022-09-24 16:11:27.245991: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:229] Restoring SavedModel bundle.
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
/usr/bin/tf_serving_entrypoint.sh: line 3:     7 Aborted                 (core dumped) tensorflow_model_server --port=8500 --rest_api_port=8501 --model_name=${MODEL_NAME} --model_base_path=${MODEL_BASE_PATH}/${MODEL_NAME} "$@"

Solution: kudos to @Sebarl: Vote C4W3 - Graded Lab 2 canary release - Tensorflow serving crash on model load and workaround

Edit tf-serving/deployment.yaml and set image version to 2.8.0:

apiVersion: apps/v1
kind: Deployment
metadata: # kpt-merge: default/image-classifier
  name: image-classifier
  namespace: default
  labels:
    app: image-classifier
spec:
  replicas: 1
  selector:
    matchLabels:
      app: image-classifier
  template:
    metadata:
      labels:
        app: image-classifier
    spec:
      containers:
      - name: tf-serving
        image: "tensorflow/serving:2.8.0"
[...]

I’ve notified the staff about your topic.
The tensorflow image version is now 2.8.0 for the lab.