Colab workbook Week 1 Object Localisation Advanced Computer Vision with TensorFlow course doesn't work

In the Week 1 “Image Classification and Object Localization” lab, the code as given does not find bounding boxes

Below is some output from the model training. As you can see, the classification_accuracy does improve until Epoch 8, when it collapses back to 10%. The bounding_box_mse INCREASES at every epoch (except the last).

If I look at actual values for the bounding_box the model finds then they are all large negative values eg

[-28.9, -36.1, -25.4, -28.6]

whereas real bounding boxes have values between 0 and 1 like

[0.53, 0.67, 0.72, 0.84]

I wondered if the problem is that the bounding box output layer needed to have sigmoid activation to keep its values between 0 and 1 but changing this

bounding_box_regression_output = tf.keras.layers.Dense(units = 4, activation='sigmoid', name = 'bounding_box')(inputs)

means that the model now sets all the bounding boxes to [1, 1, 1, 1]

There is another minor error - the code as originally written was “units =‘4’” but units is an integer not a string so the code didn’t run until I changed this. So maybe this is due to a tensorflow version change which has changed units from str to int and also changed the default value for something which is now necessary to state explicitly?

Epoch 1/10
937/937 ━━━━━━━━━━━━━━━━━━━━ 281s 295ms/step - bounding_box_loss: 2.6578 - bounding_box_mse: 11.3629 - classification_accuracy: 0.1051 - classification_loss: 0.0900 - loss: 2.7478 - val_bounding_box_loss: 2.5046 - val_bounding_box_mse: 216.6360 - val_classification_accuracy: 0.1833 - val_classification_loss: 0.0868 - val_loss: 2.5914
Epoch 2/10
937/937 ━━━━━━━━━━━━━━━━━━━━ 272s 290ms/step - bounding_box_loss: 2.5251 - bounding_box_mse: 295.2394 - classification_accuracy: 0.2509 - classification_loss: 0.0819 - loss: 2.6070 - val_bounding_box_loss: 2.5351 - val_bounding_box_mse: 666.4528 - val_classification_accuracy: 0.4712 - val_classification_loss: 0.0639 - val_loss: 2.5989
Epoch 3/10
937/937 ━━━━━━━━━━━━━━━━━━━━ 271s 290ms/step - bounding_box_loss: 2.5235 - bounding_box_mse: 907.3889 - classification_accuracy: 0.5426 - classification_loss: 0.0583 - loss: 2.5818 - val_bounding_box_loss: 2.5287 - val_bounding_box_mse: 2344.5562 - val_classification_accuracy: 0.6423 - val_classification_loss: 0.0472 - val_loss: 2.5760
Epoch 4/10
937/937 ━━━━━━━━━━━━━━━━━━━━ 281s 300ms/step - bounding_box_loss: 2.5311 - bounding_box_mse: 2069.8621 - classification_accuracy: 0.7006 - classification_loss: 0.0411 - loss: 2.5722 - val_bounding_box_loss: 2.5203 - val_bounding_box_mse: 2233.9570 - val_classification_accuracy: 0.8561 - val_classification_loss: 0.0221 - val_loss: 2.5424
Epoch 5/10
937/937 ━━━━━━━━━━━━━━━━━━━━ 270s 288ms/step - bounding_box_loss: 2.5198 - bounding_box_mse: 2399.3384 - classification_accuracy: 0.8457 - classification_loss: 0.0230 - loss: 2.5429 - val_bounding_box_loss: 2.5179 - val_bounding_box_mse: 2855.4448 - val_classification_accuracy: 0.9006 - val_classification_loss: 0.0153 - val_loss: 2.5332
Epoch 6/10
937/937 ━━━━━━━━━━━━━━━━━━━━ 271s 289ms/step - bounding_box_loss: 2.5241 - bounding_box_mse: 2879.4514 - classification_accuracy: 0.8893 - classification_loss: 0.0168 - loss: 2.5409 - val_bounding_box_loss: 2.5349 - val_bounding_box_mse: 3007.2397 - val_classification_accuracy: 0.9252 - val_classification_loss: 0.0116 - val_loss: 2.5465
Epoch 7/10
937/937 ━━━━━━━━━━━━━━━━━━━━ 271s 289ms/step - bounding_box_loss: 2.5245 - bounding_box_mse: 2894.3176 - classification_accuracy: 0.9136 - classification_loss: 0.0131 - loss: 2.5377 - val_bounding_box_loss: 2.5332 - val_bounding_box_mse: 3362.8108 - val_classification_accuracy: 0.9301 - val_classification_loss: 0.0106 - val_loss: 2.5438
Epoch 8/10
937/937 ━━━━━━━━━━━━━━━━━━━━ 273s 292ms/step - bounding_box_loss: 2.5291 - bounding_box_mse: 3301.3079 - classification_accuracy: 0.9193 - classification_loss: 0.0123 - loss: 2.5414 - val_bounding_box_loss: 2.6231 - val_bounding_box_mse: 22641.6094 - val_classification_accuracy: 0.1009 - val_classification_loss: 0.1798 - val_loss: 2.8030
Epoch 9/10
937/937 ━━━━━━━━━━━━━━━━━━━━ 281s 300ms/step - bounding_box_loss: 2.5757 - bounding_box_mse: 17880.8242 - classification_accuracy: 0.1003 - classification_loss: 0.1799 - loss: 2.7557 - val_bounding_box_loss: 2.5235 - val_bounding_box_mse: 13066.3057 - val_classification_accuracy: 0.1009 - val_classification_loss: 0.1798 - val_loss: 2.7033
Epoch 10/10
937/937 ━━━━━━━━━━━━━━━━━━━━ 277s 296ms/step - bounding_box_loss: 2.5259 - bounding_box_mse: 12436.9219 - classification_accuracy: 0.0983 - classification_loss: 0.1803 - loss: 2.7063 - val_bounding_box_loss: 2.5161 - val_bounding_box_mse: 13073.5303 - val_classification_accuracy: 0.1009 - val_classification_loss: 0.1798 - val_loss: 2.6959

Hello @cavhind123, I just ran that lab and didn’t get such behavior of the loss; perhaps you have changed something in the lab you are not meant to, so maybe reopening a fresh version could help. You are right about the units its needs be an integer, but no need to use sigmoid activation.

@chris.favila there seems to be some depreciations with this Lab, the units = ‘4’ seems to not work anymore and also if you use TPU option in Colab then the “tf.tpu.experimental…” it says its depreciated! Thank you.

Hmmm. I have run it again and this time Colab is letting me use a TPU - when I ran it before it only let me use CPU. And it works - val_bounding_box_loss: 0.0059 - val_classification_accuracy: 0.9574 and 43% of my IOUs are now over 0.6

Weirdly, this time the “units=‘4’” is NOT flagged as an error. Also a lot of other stuff: eg notice before that the bounding_box_loss wasn’t equal to the bounding_box_mse - which is very odd, because it was using MSE as the loss function! That is now also fixed on the TPUs

So I am guessing there is some difference between what is executed on TPUs to CPU which is causing the problem

But…even now it’s not 100% right - I get an error on the line

print('Running on TPU ', tpu.cluster_spec().as_dict()[‘worker’])

in <cell line: 9>()
11 tf.tpu.experimental.initialize_tpu_system(tpu)
12 strategy = tf.distribute.experimental.TPUStrategy(tpu) # Going back and forth between TPU and host is expensive. Better to run 128 batches on the TPU before reporting back.
—> 13 print('Running on TPU ', tpu.cluster_spec().as_dict()[‘worker’])
14 elif len(gpus) > 1:
15 strategy = tf.distribute.MirroredStrategy([gpu.name for gpu in gpus])

KeyError: ‘worker’

…just deleting this line means everything works.

Thank you !

2 Likes

Hi Chris. Thank you for reporting and welcome to the Forum! We’ll keep this in mind as we review the notebooks in this course. There might be some Colab updates in the backend that produced this behavior.

@chris.favila I’m getting the same issue as @cavhind123. My notebook is also using CPU, and I also had to change the units=4 line. I am also getting bounding boxes that are wildly inaccurate (IoU around 1e-14).

Hi Yan, and thank you for the follow up! Will look at this today and update the notebook as necessary.

Hi! It seems like the problem was the initialization of the TPU. It’s now refactored to use the same sequence we have in another lab. The bounding boxes behave as expected now. Please reopen the notebook from the classroom to see the changes. Thanks!

Google let me use their TPU this morning and it started working. I tried with CPU as well to test your updates (right after I saw your message), but the bounding box values still didn’t converge. I also tried changing learning rate, adding learning rate decay, changing batch size, adding gradient clipping, and adding higher weights to bounding box loss, but none of those helped on CPU.

from what I remembered when I did the specialisation, course3 and course 4 assignment required me to use the TPU, no matter how was my my model, the probable reason could be because of the amount of dataset being used or the CPU core one is running the model on.

and Colab TPU work on a 24-hour cycle, so I usually could retrain my models to max 2 times with tpu but not with CPU

1 Like