Vote C4W3 - Graded Lab 2 canary release - Tensorflow serving crash on model load and workaround

Hi!

While completing the C4W3 second graded lab I encountered an error while deploying the Resnet models. The Resnet 50 deployment would never become ready. Checking the pod status (with kubectl get pods) it was in a CrashLoopBackOff state. Using kubectl logs I extracted the log from the failing pod:


kubectl logs image-classifier-resnet50-dc6746f88-sl7p4
Defaulted container "tf-serving" out of: tf-serving, istio-proxy, istio-init (init)
2022-09-12 22:05:37.775855: I tensorflow_serving/model_servers/server.cc:89] Building single TensorFlow model file config:  model_name: image_classifier model_base_path: gs://qwiklabs-gcp-00-0f754a097cf6-bucket/resnet_50
2022-09-12 22:05:37.776153: I tensorflow_serving/model_servers/server_core.cc:465] Adding/updating models.
2022-09-12 22:05:37.776188: I tensorflow_serving/model_servers/server_core.cc:594]  (Re-)adding model: image_classifier
2022-09-12 22:05:38.748740: I tensorflow_serving/core/basic_manager.cc:740] Successfully reserved resources to load servable {name: image_classifier version: 1}
2022-09-12 22:05:38.748799: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: image_classifier version: 1}
2022-09-12 22:05:38.748828: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: image_classifier version: 1}
2022-09-12 22:05:38.854230: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:45] Reading SavedModel from: gs://qwiklabs-gcp-00-0f754a097cf6-bucket/resnet_50/1
2022-09-12 22:05:39.110604: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:89] Reading meta graph with tags { serve }
2022-09-12 22:05:39.110682: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:130] Reading SavedModel debug info (if present) from: gs://qwiklabs-gcp-00-0f754a097cf6-bucket/resnet_50/1
2022-09-12 22:05:39.264440: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network
Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-09-12 22:05:39.359008: I external/org_tensorflow/tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:354] MLIR V1 optimization pass is not enabled
2022-09-12 22:05:39.373487: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:229] Restoring SavedModel bundle.
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
/usr/bin/tf_serving_entrypoint.sh: line 3:     7 Aborted                 (core dumped) tensorflow_model_server --port=8500 --rest_api_port=8501 --model_name=${MODEL_NAME} --model_base_path=${MODEL_BASE_PATH}/${MODEL_NAME} "$@"

It seems that the tensorflow/serving latest image crashes when loading the Resnet50 model. I was able to solve the problem by using an older version of tensorflow/serving, in particular 2.8.0 (probably other versions would work as well).

The updated deployment-resnet50.yaml file looks like this:

apiVersion: apps/v1
kind: Deployment
metadata: # kpt-merge: default/image-classifier-resnet50
  name: image-classifier-resnet50
  namespace: default
  labels:
    app: image-classifier
    version: resnet50
spec:
  replicas: 1
  selector:
    matchLabels:
      app: image-classifier
      version: resnet50
  template:
    metadata:
      labels:
        app: image-classifier
        version: resnet50
    spec:
      containers:
      - name: tf-serving
        image: "tensorflow/serving:2.8.0"
        args:
        - "--model_name=$(MODEL_NAME)"
        - "--model_base_path=$(MODEL_PATH)"
        envFrom:
        - configMapRef:
            name: resnet50-configs
        imagePullPolicy: IfNotPresent
        readinessProbe:
          tcpSocket:
            port: 8500
          initialDelaySeconds: 10
          periodSeconds: 5
          failureThreshold: 10
        ports:
        - name: http
          containerPort: 8501
          protocol: TCP
        - name: grpc
          containerPort: 8500
          protocol: TCP
        resources:
          requests:
            cpu: "3"
            memory: 4Gi

The only change is the image tag. Of course, the Resnet101 has the same problem with an equivalent workaround.

Cheers!
Sebastián.

2 Likes

I met the same problem and it works with you answer.

cheers

Hi Sebastian! Thank you for sharing this workaround! We’ve actually reported this bug before and we’ll follow up with our partners so it can be fixed. Thanks again!

It looks like something was adjusted so that the version tag is now present on the image. Unfortunately, task 5 now fails (probably due to the timing of change issue mentioned on this thread). But also, looks like you can ‘pass’ the lab w/a 75 or 85 if task 5 is not complete. So :person_shrugging:

Where can we see the workaround for the yaml file conversion C4W3 Graded lab 2.
Getting an error in yaml to json conversion

Error in C4W3 : Task 6 despite clearing previous steps

“”"

student_01_e8b7ec3e009f@cloudshell:~/tfserving-canary (qwiklabs-gcp-02-6aea0e84c7d4)$ kubectl apply -f tf-serving/deployment-resnet50.yaml
error: error parsing tf-serving/deployment-resnet50.yaml: error converting YAML to JSON: yaml: line 37: did not find expected key

After making these changes , still getting an error

 - name: tf-serving
    image: "tensorflow/serving:2.8.0"
    args:
    - "--model_name=$(MODEL_NAME)"
    - "--model_base_path=$(MODEL_PATH)"
    envFrom:
    - configMapRef:
        name: resnet50-configs
    imagePullPolicy: IfNotPresent
    readinessProbe:
      tcpSocket:
        port: 8500
      initialDelaySeconds: 10
      periodSeconds: 5
      failureThreshold: 10
    ports:
    - name: http
      containerPort: 8501
      protocol: TCP
    - name: grpc
      containerPort: 8500
      protocol: TCP        resources:          requests:
        cpu: "3"
        memory: 4Gi

student_01_ed20a4bdd641@cloudshell:~/tfserving-canary (qwiklabs-gcp-02-ac9e79e3e268) kubectl get deployments student_01_ed20a4bdd641@cloudshell:~/tfserving-canary (qwiklabs-gcp-02-ac9e79e3e268) kubectl get deployments
No resources found in default namespace.

Hello,

same problem here, even setting “tensorflow/serving:2.8.0”
I think this is terrible dissapointing

@Aanand @pgalilea


There is a typo in the file. Another '‘image’. Just remove it:

Hope it helps. Cheers,

3 Likes

Yes, thanks. also noticed a lot of indentation errors, which I corrected

1 Like

After I followed @Gabriele_Boncoraglio post (note you have to do this for both resnet50 and resnet101 deployment yaml files) I was able to complete the lab without any problems. Make sure to follow instructions for updating your bucket in both resnet50/101 config yaml files.

For Task 9: Applying/Deploying resnet101 (towards end of the task)
-Update the configmap-rsesnet101.yaml file: add your bucket info

-Apply the config yaml: Refer back to task 5.2, reuse that code but make sure to replace resnet50 with resnet101

  • Apply deployment yaml: Refer back to task 6.2, reuse that code reuse that code but make sure to replace resnet50 with resnet101

Good luck everyone - you can do it!

Hi Gabriele! Welcome to the community and thank you for sharing this! We’ve reported this to our partners so they can fix the repo asap. I’m marking this as the solution to make it more visible to other learners. Thanks again!

Hi everyone! The typos in the deployment YAML files have now been fixed! Thank you again for reporting!