C3W3 assignment TF on Kubernetes: error push refs to repo

Hello,

in assignment " Distributed Multi-worker TensorFlow Training on Kubernetes" in the section “Packaging training code in a docker image:” I get the following error message

"student_03_f21a10da0935@cloudshell:~ (qwiklabs-gcp-00-96520d0335e5)$ IMAGE_NAME=mnist-train
docker build -t gcr.io/${qwiklabs-gcp-00-96520d0335e5}/${IMAGE_NAME} .
docker push gcr.io/${qwiklabs-gcp-00-96520d0335e5}/${IMAGE_NAME}
unable to prepare context: unable to evaluate symlinks in Dockerfile path: lstat /home/student_03_f21a10da0935/Dockerfile: no such file or directory
Using default tag: latest
The push refers to repository [gcr.io/gcp-00-96520d0335e5/mnist-train]
An image does not exist locally with the tag: gcr.io/gcp-00-96520d0335e5/mnist-train
student_03_f21a10da0935@cloudshell:~ (qwiklabs-gcp-00-96520d0335e5)$ ^C
student_03_f21a10da0935@cloudshell:~ (qwiklabs-gcp-00-96520d0335e5)$ cd
SRC_REPO=https://github.com/GoogleCloudPlatform/mlops-on-gcp
kpt pkg get $SRC_REPO/workshops/mlep-qwiklabs/distributed-training-gke lab-files
cd lab-files
Package "distributed-training-gke":
Fetching https://github.com/GoogleCloudPlatform/mlops-on-gcp@master
From https://github.com/GoogleCloudPlatform/mlops-on-gcp
 * branch            master     -> FETCH_HEAD
Adding package "workshops/mlep-qwiklabs/distributed-training-gke".

Fetched 1 package(s).
student_03_f21a10da0935@cloudshell:~/lab-files (qwiklabs-gcp-00-96520d0335e5)$     strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
    task_type = strategy.cluster_resolver.task_type
    task_id = strategy.cluster_resolver.task_id
    global_batch_size = per_worker_batch * strategy.num_replicas_in_sync
-bash: syntax error near unexpected token `('
-bash: task_type: command not found
-bash: task_id: command not found
-bash: global_batch_size: command not found
student_03_f21a10da0935@cloudshell:~/lab-files (qwiklabs-gcp-00-96520d0335e5)$ IMAGE_NAME=mnist-train
docker build -t gcr.io/${qwiklabs-gcp-00-96520d0335e5}/${IMAGE_NAME} .
docker push gcr.io/${qwiklabs-gcp-00-96520d0335e5}/${IMAGE_NAME}
Sending build context to Docker daemon  36.86kB
Step 1/4 : FROM tensorflow/tensorflow:2.4.1
 ---> 45872ba1e662
Step 2/4 : RUN pip install tensorflow_datasets
 ---> Using cache
 ---> 20c7821506d1
Step 3/4 : ADD mnist mnist
 ---> Using cache
 ---> 1df15f420abc
Step 4/4 : ENTRYPOINT ["python", "-m", "mnist.main"]
 ---> Using cache
 ---> e11a4aca8915
Successfully built e11a4aca8915
Successfully tagged gcr.io/gcp-00-96520d0335e5/mnist-train:latest
Using default tag: latest
The push refers to repository [gcr.io/gcp-00-96520d0335e5/mnist-train]
db8f5690fd17: Retrying in 18 seconds
c8a860688418: Retrying in 18 seconds
097e070097db: Retrying in 19 seconds
e95ae1c1e1a8: Retrying in 19 seconds
e43210c84711: Retrying in 19 seconds
74dfe3df0c94: Waiting
8e29486d090c: Waiting
76bfe8e7e45c: Waiting
3779360d2582: Waiting
9f10818f1f96: Waiting
27502392e386: Waiting
c95d2191d777: Waiting
"

The steps before "Task 4. Preparing TFJob" could be done without errors and I got 70/100 for it.
Is it an issue of the internet connection or did I make an error?

Trying to evaluate it, it says "Please update and submit the TFJob manifest. If already done then wait till the job is succeeded and all pods are in running state."
From my understanding I made changes in the "tfjob.yaml" correctly:
"apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: multi-worker
spec:
  cleanPodPolicy: None
  tfReplicaSpecs:
    Worker:
      replicas: 3
      template:
        spec:
          containers:
          - name: tensorflow
            image: gcr.io/**qwiklabs-gcp-00-96520d0335e5**/mnist-train
            args:
            - --epochs=5
            - --steps_per_epoch=100
            - --per_worker_batch=64
            - --saved_model_path=gs://qwiklabs-gcp-01-93af833e6576-bucket/saved_model_dir
            - --checkpoint_path=gs://qwiklabs-gcp-01-93af833e6576-bucket/checkpoints"

Thanks a lot :slight_smile:

BR,
Simon

Couple of things:

  1. Please don’t blindly copy / paste instructions from qwiklabs to the bash shell. For instance, the trace shows that you’re pasting python code in the shell.
  2. The image name is gcr.io/${qwiklabs-gcp-00-96520d0335e5}/${IMAGE_NAME}. Why do you need the additional ** stars in tfjob.yaml ?
1 Like

I trying a lot of different combination a commands, amongst which some might be wrong.
No it is clear to not pasty python code and not use the stars “*”. Thanks a lot.

@balaji.ambresh: yesterday evening it worked with the default “deployment.yaml”. I think I had connections issues, when I had the issues and thought adapting the file would solve the issues
Nevertheless, thanks a lot :slight_smile: