C4W3 - UNet Assignment - Insight needed (accuracy crash !)

Hi,

So some insight needed here. I was able to complete the assignment, 100%.

However, just before wrapping up, even Prof. Ng in the notebook suggests you get ‘amazing’ results at 40 epochs-- So I figured: Let’s try it ! Let’s see what it looks like !

And everything was running fine and dandy with accuracy rates > 97%… Until the last epoch, 40, when suddenly accuracy completely falls off a cliff (~43%).

Anyone have any idea what is happening here ?!?



1 Like

The “falling off a cliff” phenomenon can happen at any point. It’s not predictable. The solution surfaces here are incredibly contorted and you can be cruising along a gradient just fine and then you take a step just a teensy bit too far and you’re Wile E. Coyote stepping off the edge of the cliff. That does seem to happen on this particular dataset more easily than we see in most of the other cases here. Not sure why that is.

There are also some mysteries here in terms of how to get predictable behavior out of the training in TF. A couple of years ago when I first hit this, I tried a bunch of experiments with setting the random seeds in various ways (both at the numpy and TF levels) and still was not able to get reproducible behavior here. I meant to get back to it and do more research, but that hasn’t happened …

3 Likes

@paulinpaloalto Hmmm, interesting.

Going back through the code the only ‘random’ think I can think of here is the presence of dropout, though as you say you’d think that would be governed under control of the seed generator.

2 Likes

That’s a good theory, but my hunch is that it’s something deeper than that about how training works in Keras. There must be some source of non-determinism there that you can’t stamp out with random seeds, e.g. maybe the order of the samples in the randomly shuffled minibatches when you use the Dataset class is not governed by the random seeds.

2 Likes

Hello @Nevermnd, @paulinpaloalto,

Use the following code for getting a reproducible model. This reference explains what’s going on with determinism.

EPOCHS = 40
VAL_SUBSPLITS = 5
BUFFER_SIZE = 500
BATCH_SIZE = 32

# Key additions:
tf.config.experimental.enable_op_determinism()
tf.keras.utils.set_random_seed(1)
# For more on determinism, check out
# https://www.tensorflow.org/api_docs/python/tf/config/experimental/enable_op_determinism

train_dataset = (
    processed_image_ds
    .cache()
    .shuffle(BUFFER_SIZE) # original, better (but not needed here) to set the seed parameter explicity
    .batch(BATCH_SIZE)
)

unet = unet_model((img_height, img_width, num_channels))
unet.compile(
    optimizer='adam',
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy']
)
model_history = unet.fit(train_dataset, epochs=EPOCHS)

My results:

Epoch 1/40
34/34 [==============================] - 18s 216ms/step - loss: 2.0005 - accuracy: 0.4070
...
Epoch 5/40
34/34 [==============================] - 3s 83ms/step - loss: 0.4040 - accuracy: 0.8793
...
Epoch 40/40
34/34 [==============================] - 3s 82ms/step - loss: 0.3304 - accuracy: 0.9006

image

Cheers,
Raymond

3 Likes

Raymond to the rescue! Thanks so much for giving us the real story here!

2 Likes

@rmwkwok Dear Raymond,

Yes, thank you very much for clearing up the determinism point.

However, if you don’t mind me still asking, I guess my original question still holds ?

Or what we’ve seen/learned so far, we all know if you train too long (grokking aside) you are prone to over-fitting. This still doesn’t quite seem to fully explain the huge drop that happened for me at 40 epochs, and you at ~26-28.

I mean it seems like this can’t be explained by overfitting, especially as we recover from it (perhaps we hit a local minima ?).

Further, is this the sort of case that warrants ‘early-stopping’ (say at 25) ? Or would it be better to say that even though the accuracy is lower, going all the way up to 40 might, in the end, generalize better ?

2 Likes

Hello Anthony @Nevermnd,

Still holds.

“Overfit” doesn’t explain drops. It explains the widening of the gap between training & dev scores.

“Early Stopping” stops us earlier and avoid both the drop & reaching better score after the drop. The former is favourable, while the latter is not.

Before we move on, I think we need to be comfortable with two facts

  • the loss surface is a total unknown to us. Gradient descent only walks us one path in this whole unknown space;

  • the loss surface is co-determined by the (i) loss function (ii) model architecture (iii) training data. (i) & (ii) are fixed, but (iii) varies from mini-batch to mini-batch. In other words, at each move of the gradient descent walk, it is walking in a different loss surface depending on the changing training mini-batch.

I think most people don’t pay enough attention to the 2nd fact.

Let’s focus on the 2nd fact from now on.

We want each mini-batch to be distributing similarly with the whole training set, so that the loss surfaces built with the minibatches are like the one that is built with the whole training set.

What happen if a mini-batch is very different from the whole set? Will the loss surface with such a mini-batch lead us to somewhere better or worse?

While we think about this, I will suggest something in my next reply.

Cheers,
Raymond

2 Likes

The best way to investigate this is to look at the gradients’ distribution in each layer over the epochs. It was easier in TF 1.x on Tensorboard but unfortunately, it is not supported in TF 2.x and will require us to implement it ourselves. I don’t have handy code for that, but here should be a pretty good starting point. I strongly recommend you to try it out.

Another way is to experiment with different batch sizes and seeds, and count, for each batch size, how many seeds show a drop. For example, combining the code below and this code will do, but it will take a pretty long time too.

from itertools import product

for BATCH_SIZE, seed in product(
    [4, 16, 32, 64, 128, 256], # batch sizes
    range(10), # seeds
):
    print(BATCH_SIZE, seed)

    tf.config.experimental.enable_op_determinism()
    tf.keras.utils.set_random_seed(seed)

    ...

A few points to note:

  • If the experiment shows that fewer drops with larger batch size, it shows the coorelation and supports my theory

  • Observing the gradients with Tensorboard is a better way to understand the problem, and potentially will get you more insights

Cheers,
Raymond

2 Likes

@rmwkwok @paulinpaloalto Thanks both. Let me look over this in the morning and get back if I still have ?

– Night

1 Like

Results (click to enlarge the image):

Graphs downloadable here: download.zip (518.1 KB)

We don’t observe such drop with batch sizes = 64, 128. Note that the dataset has 1060 samples, and the buffer size for shuffling is 500.

If we want to dig deeper, at some point of time, we would have to go from here:

Cheers,
Raymond

3 Likes

@rmwkwok Dear Raymond, thanks very much for providing this.

Hmmm, your earlier comment (point 2) makes sense and it is rather interesting your results here.

I mean earlier Prof. Ng gave us some suggested batch sizes (32, 64, 128, 256 or 512) but there was no real indication they might affect test performance.

The same is basically relayed here from Yoshua Bengio’s Practical recommendations for gradient-based training of deep architectures (2012):

But obviously as you show it can make a difference.

I wonder if it is also a factor that (at least as far as I can see in the code) we are not using any sort of learning rate/weight decay (?).

I also find it curious (as per your first run above [reposted below], and a small subset of your larger run here) that even when accuracy recovers, it does so at a level that is lower than before the crash. Notably from your more expansive matrix this does not occur in every case:

image

Unfortunately I’ve never used Tensorboard before so I am not sure what it is capable of. I might take the Tensorflow Specialization after this one, as honestly I have only used Tensorflow thus far insofar as it has been covered in this class.

1 Like

I think the reason the batch size makes a notable impact here is that 40 epochs isn’t sufficient to get this system to converge.

It’s rather important to look at all of the data several times before you can say that system has been trained adequately.

Regarding your 2nd point, I’d venture that at epoch 26, the system was overfit to the portion of data it had seen up to then. When it gets to epoch 40, it’s seen more data and is no longer overfit as badly.

1 Like

For the learning rate decay / lower level questions, you can do some experiments to verify.

For example, you may run 100 epochs and see if the level will turn out be even/better/worse.

What about learning rate decay? Maybe Tensorflow already provides some handy tool for you and you can quickly try a medium and a strong decay on the case of smaller batch size (with which you can expect drops more often)?

With experiment results, it is easier for you to further develop the (counter-) arguments.

Cheers,
Raymond

1 Like

@rmwkwok I always appreciate your input-- However, to all present it seems worthy to note, from Prof. Ng’s own notebook he suggested 40 epochs ‘turn out great !’ – And it just didn’t. And this is the only reason I started this inquiry.

I mean if it was my own model and data set, all bets are off.

And, yes, I completely agree with you, we need exact evidence to support our conjectures-- Though, I am still taking the course, so perhaps I should not try to ‘walk and run’ at the same time. :grin:

3 Likes

Hello Anthony @Nevermnd,

No problem! I was just sharing ways for you to verify some of your hypotheses, but they could be tried out at any time.

If hypothesis is left foot, then verification approach is right foot? I think only with both we can move forward :wink:

From the graphs I share, we could achieve some good performance with or within 40 epochs, and that is my observation. :wink: With the observations, I don’t consider “40” as any magic number, and this is my conclusion.

That seems to be a number we can challenge. :wink:

Cheers,
Raymond

1 Like

I suggested the staff to implement reproducibility and adjust the statement about “40 epochs” according to a reproducible model, but, obviously, from the graphs, we know neither 40 or any updated value of it is special.

I think the idea is, there is a higher chance to get a better model with more epochs than the default five rounds of training.

Cheers,
Raymond

2 Likes