Does training = true in the resnet blocks break the code?

In the ResNets assignment, we hardwire in a ‘training = true’ line to the identity and convolutional blocks. This allows us to choose whether batch normalisation operates in training mode.

This is fine in training where we want the batch normalisation running means etc to work across a whole batch.

But in inference mode - are we forcing the model to operate in training mode - meaning predictions will not use the batch-norm averages calculated during training but will create new ones based on the set of inputs handed to the predict function?

My evidence for this is detailed in this post. The basic point is that, using the model developed in the assignment, if you call the predict function on all X_test objects and then ask for a specific prediction, you get a different answer to calling the predict function on just that specific input. If you comment out the batch-norm lines in the code you remove the discrepancy.

Any thoughts much appreciated.

That’s an interesting point (and shines a stronger light on your other post). I’ll look into it.

1 Like

Thanks for having a look. Just to add one more piece of the puzzle: the pre-trained model resnet50.h5 in the workshop doesn’t have the same issue. Regardless of whether you predict on one example or the whole X_test set, it returns consistent answers (in contrast to the model built up during the assignment).

Are you able to share the exact code that produced that particular model fit? It might help shine some light on what’s going on under the hood.

Sorry, I don’t know where the pre-trained model came from. I can ask the course staff if necessary.

All indications in the Keras documentation is that the model.predict() process should not trigger any re-training. It should just use the model that’s already been fit.

I doubt that model.predict() will do any training. I have not looked into your report regarding the predictions in your other thread.

I think where “training = False” would be used is if you’re taking a pre-trained model, and doing some additional training on certain layers. That comes up later in C4 or C5 (can’t remember at the moment).

I’m still digging around.

Note that the model is just the definition of the architecture. Training (via the model.fit() method) all happens in the background. The details (backpropagation, weight updates, etc) are all hidden from view. There aren’t a lot of details on what the .predict() method does, but I’m looking at that also.

Essentially, the way I’m looking at model.predict(), it should not invoke any backpropagation or weight updates, so won’t change the model, regardless of what the “training = …” argument is set to.

But the documentation for Keras is extremely thin, so it’s difficult to say.

This page:

… does mention predictions and “training = False”, but I think that reference is just to using the call() method directly for better efficiency on small data sets. It’s a puzzling thing to put in the documentation, though.

I compared the summary of the ResNet50 model that we created and the pretrained model and don’t see any difference in the number of trainable parameters. But then I took a look at the resnet50 function and noticed that it does not pass the training parameter to any of the instances of the identity block or conv block. But the definitions of those functions set training = True. So what I then did was change the definitions of those functions to set the default value of training to False. When I do that, your two predict calls produce the same results. I think the conclusion is that the batch normalization layers are updating their parameters as you run predict. Note that the parameters of the BatchNorm layers evidently don’t count as “trainable parameters” according to whatever metric the summary method uses. But they evidently are changing. We can conjecture that the pretrained model sets training = False in the declarations of the identity and conv blocks. So now the question is whether it is a bug that the notebook version sets it to True.

That seems incredibly wacky. I’m struggling to find a reason why that would be a feature rather than a bug in Keras.

And thanks for looking into it.

But the Keras BatchNorm code just did what was asked of it, right? The problem is our code passed the default value of training = True, because it wasn’t overridden on the calls to identity block and convolutional block. I guess maybe the zany part of it is that the training of BatchNorm layers is a whole separate mechanism from the usual back prop stuff. Or maybe all this really says is that I do not “grok” BatchNormalization. A charge to which I plead guilty without hesitation. :nerd_face:

Seems like “applying normalization” is not inherently part of a “training” configuration. But the assignment code treats it that way.

Right, but Batch Normalization is a bit different than what we usually think of as “normalization”. It’s using the mean and variance of the data on each neuron to do the normalization, so the question is whether you freeze those normalization coefficients based on the data you saw in training (training = False) or whether you adjust them dynamically based on the data you’re actually seeing (training = True).

The lectures from Prof Ng about BatchNorm are back in Course 2 Week 3. At that point we’re right at the hairy edge of switching from writing things ourselves in python and switching to TF and BatchNorm is right at that boundary. I’ll have to go back and watch those lectures again, but I remember them being a little unsatisfying. He doesn’t really show how the learning and back prop work, because we never actually have to build that “by hand”.

I agree with @paulinpaloalto - batch-norm has to save the mean parameters it stores during training to then apply these statically during inference. It definitely doesn’t make sense to vary the batch-norm parameters after training otherwise because - as we’ve seen here - you end up with a predict function that gives different predictions depending on how many examples you are handing it.

I’ve gone ahead and removed the training=true option from all the identity_blocks and convolution_blocks and it behaves as you’d expect in terms of giving the same prediction for the same example regardless of whether you just predict X_train[3] or return all predictions then select the third one.

One slightly unexpected side effect I’ve learnt from all this: when running the code with training=true hardwired (i.e. the code for the assignment) you can run for 10 epochs and get pretty good test and train accuracies (using the code below to return these).

prediction_train = model.predict(X_train)
print("Train accuracy = ", np.mean( np.argmax(prediction_train, axis=1) == np.argmax(Y_train, axis=1)))

prediction_test = model.predict(X_test)
print("Test accuracy = ", np.mean( np.argmax(prediction_test, axis=1) == np.argmax(Y_test, axis=1)))

Running this gives an output of:

Epoch 10/10
34/34 [==============================] - 1s 24ms/step - loss: 0.0867 - accuracy: 0.9741
Train accuracy = 0.9833333333333333
Test accuracy = 0.9166666666666666

But if you only run the (I think) corrected code with all the training = true calls removed, you have to train longer. If you only train 10 epochs you end up with apparently good accuracies from the model.fit() output but terrible accuracies reported afterwards - see output below of the 10th epoch, final batch and results of running the above code straight after.

Epoch 10/10
34/34 [==============================] - 1s 23ms/step - loss: 0.2694 - accuracy: 0.9241
Train accuracy = 0.30925925925925923
Test accuracy = 0.2916666666666667

You have to run this version longer - e.g. 20 epochs to get a good training (and test) accuracy.

My intuition for this pretty confusing behaviour is that, in the model run in class - because training is switched on all the time in the batch norm calls, when I call calculated train accuracy straight after the model has been fit, I’m using the most recently updated batch-norm parameters - which mean I get a really similar result (98%) to the last model.fit() number (97%).

But batch-norm parameters should not keep updating - they should be averages based on the overall training process. My hunch is that, when we’ve removed training=true (i.e. running what I think is the correct code) the running averages / variances haven’t yet settled down - meaning even though accuracy looks good we actually haven’t fit the model after 10 epochs. Only running even longer and evaluating the full training accuracy (not the accuracy of the last mini-batch) can fix this. You can also pull down the batch size to 16 then run for 10 epochs and get a stable result.

If you’ve followed all that well done - I’ve been puzzling over this for days now! I’m posting a full MWE to execute what I think is the saved resnet50.h5 model in the lectures. Let me know what you think - interested to hear your responses and thoughts on this.

MWE - possible corrected ResNets50.h5 model code.

import tensorflow as tf
import numpy as np
import scipy.misc
from tensorflow.keras.applications.resnet_v2 import ResNet50V2
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.resnet_v2 import preprocess_input, decode_predictions
from tensorflow.keras import layers
from tensorflow.keras.layers import Input, Add, Dense, Activation, ZeroPadding2D, BatchNormalization, Flatten, Conv2D, AveragePooling2D, MaxPooling2D, GlobalMaxPooling2D
from tensorflow.keras.models import Model, load_model
from resnets_utils import *
from tensorflow.keras.initializers import random_uniform, glorot_uniform, constant, identity
from tensorflow.python.framework.ops import EagerTensor
from matplotlib.pyplot import imshow

from test_utils import summary, comparator
import public_tests


def identity_block_c(X, f, filters, initializer=random_uniform):
    """
    Implementation of the identity block as defined in Figure 4
    
    Arguments:
    X -- input tensor of shape (m, n_H_prev, n_W_prev, n_C_prev)
    f -- integer, specifying the shape of the middle CONV's window for the main path
    filters -- python list of integers, defining the number of filters in the CONV layers of the main path
    training -- True: Behave in training mode
                False: Behave in inference mode
    initializer -- to set up the initial weights of a layer. Equals to random uniform initializer
    
    Returns:
    X -- output of the identity block, tensor of shape (m, n_H, n_W, n_C)
    """
    
    # Retrieve Filters
    F1, F2, F3 = filters
    
    # Save the input value. You'll need this later to add back to the main path. 
    X_shortcut = X
    
    # First component of main path
    X = Conv2D(filters = F1, kernel_size = 1, strides = (1,1), padding = 'valid', kernel_initializer = initializer(seed=0))(X)
    X = BatchNormalization(axis = 3)(X) # Default axis
    X = Activation('relu')(X)
    
    ### START CODE HERE
    ## Second component of main path (≈3 lines)
    X = Conv2D(filters = F2, kernel_size = f, strides = (1,1), padding = 'same', kernel_initializer = initializer(seed=0))(X)
    X = BatchNormalization(axis = 3)(X) # Default axis
    X = Activation('relu')(X)

    ## Third component of main path (≈2 lines)
    X = Conv2D(filters = F3, kernel_size = 1, strides = (1,1), padding = 'valid', kernel_initializer = initializer(seed=0))(X)
    X = BatchNormalization(axis = 3)(X) # Default axis
    
    ## Final step: Add shortcut value to main path, and pass it through a RELU activation (≈2 lines)
    X = tf.keras.layers.Add()([X, X_shortcut])
    X = Activation('relu')(X) 
    ### END CODE HERE

    return X


def convolutional_block_c(X, f, filters, s = 2, initializer=glorot_uniform):
    """
    Implementation of the convolutional block as defined in Figure 4
    
    Arguments:
    X -- input tensor of shape (m, n_H_prev, n_W_prev, n_C_prev)
    f -- integer, specifying the shape of the middle CONV's window for the main path
    filters -- python list of integers, defining the number of filters in the CONV layers of the main path
    s -- Integer, specifying the stride to be used
    training -- True: Behave in training mode
                False: Behave in inference mode
    initializer -- to set up the initial weights of a layer. Equals to Glorot uniform initializer, 
                   also called Xavier uniform initializer.
    
    Returns:
    X -- output of the convolutional block, tensor of shape (n_H, n_W, n_C)
    """
    
    # Retrieve Filters
    F1, F2, F3 = filters
    
    # Save the input value
    X_shortcut = X


    ##### MAIN PATH #####
    
    # First component of main path glorot_uniform(seed=0)
    X = Conv2D(filters = F1, kernel_size = 1, strides = (s, s), padding='valid', kernel_initializer = initializer(seed=0))(X)
    X = BatchNormalization(axis = 3)(X)
    X = Activation('relu')(X)

    ### START CODE HERE
    
    ## Second component of main path (≈3 lines)
    X = Conv2D(filters = F2, kernel_size = f, strides = (1, 1), padding='same', kernel_initializer = initializer(seed=0))(X)
    X = BatchNormalization(axis = 3)(X)
    X = Activation('relu')(X) 

    ## Third component of main path (≈2 lines)
    X = Conv2D(filters = F3, kernel_size = 1, strides = (1, 1), padding='valid', kernel_initializer = initializer(seed=0))(X)
    X = BatchNormalization(axis = 3)(X)
    
    ##### SHORTCUT PATH ##### (≈2 lines)
    X_shortcut = Conv2D(filters = F3, kernel_size = 1, strides = (s, s), padding='valid', kernel_initializer = initializer(seed=0))(X_shortcut)
    X_shortcut = BatchNormalization(axis = 3)(X_shortcut)

    
    ### END CODE HERE

    # Final step: Add shortcut value to main path (Use this order [X, X_shortcut]), and pass it through a RELU activation
    X = Add()([X, X_shortcut])
    X = Activation('relu')(X)
    
    return X



def ResNet50_c(input_shape = (64, 64, 3), classes = 6):
    """
    Stage-wise implementation of the architecture of the popular ResNet50:
    CONV2D -> BATCHNORM -> RELU -> MAXPOOL -> CONVBLOCK -> IDBLOCK*2 -> CONVBLOCK -> IDBLOCK*3
    -> CONVBLOCK -> IDBLOCK*5 -> CONVBLOCK -> IDBLOCK*2 -> AVGPOOL -> FLATTEN -> DENSE 

    Arguments:
    input_shape -- shape of the images of the dataset
    classes -- integer, number of classes

    Returns:
    model -- a Model() instance in Keras
    """
    
    # Define the input as a tensor with shape input_shape
    X_input = Input(input_shape)

    
    # Zero-Padding
    X = ZeroPadding2D((3, 3))(X_input)
    
    # Stage 1
    X = Conv2D(64, (7, 7), strides = (2, 2), kernel_initializer = glorot_uniform(seed=0))(X)
    X = BatchNormalization(axis = 3)(X)
    X = Activation('relu')(X)
    X = MaxPooling2D((3, 3), strides=(2, 2))(X)

    # Stage 2
    X = convolutional_block_c(X, f = 3, filters = [64, 64, 256], s = 1)
    X = identity_block_c(X, 3, [64, 64, 256])
    X = identity_block_c(X, 3, [64, 64, 256])

    ### START CODE HERE
    
    ## Stage 3 (≈4 lines)
    X = convolutional_block_c(X, f = 3, filters = [128, 128, 512], s = 2) 
    X = identity_block_c(X, 3, [128, 128, 512]) 
    X = identity_block_c(X, 3, [128, 128, 512]) 
    X = identity_block_c(X, 3, [128, 128, 512])  
    
    ## Stage 4 (≈6 lines)
    X = convolutional_block_c(X, f = 3, filters = [256, 256, 1024], s = 2)  
    X = identity_block_c(X, 3, [256, 256, 1024])  
    X = identity_block_c(X, 3, [256, 256, 1024])  
    X = identity_block_c(X, 3, [256, 256, 1024])  
    X = identity_block_c(X, 3, [256, 256, 1024])  
    X = identity_block_c(X, 3, [256, 256, 1024])  

    ## Stage 5 (≈3 lines)
    X = convolutional_block_c(X, f = 3, filters = [512, 512, 2048], s = 2)  
    X = identity_block_c(X, 3, [512, 512, 2048])  
    X = identity_block_c(X, 3, [512, 512, 2048])   

    ## AVGPOOL (≈1 line). Use "X = AveragePooling2D(...)(X)"
    X = AveragePooling2D((2, 2))(X)
    
    ### END CODE HERE

    # output layer
    X = Flatten()(X)
    X = Dense(classes, activation='softmax', kernel_initializer = glorot_uniform(seed=0))(X)
    
    
    # Create model
    model = Model(inputs = X_input, outputs = X)

    return model



# load data
X_train_orig, Y_train_orig, X_test_orig, Y_test_orig, classes = load_dataset()

# Normalize image vectors
X_train = X_train_orig / 255.
X_test = X_test_orig / 255.

# Convert training and test labels to one hot matrices
Y_train = convert_to_one_hot(Y_train_orig, 6).T
Y_test = convert_to_one_hot(Y_test_orig, 6).T

print ("number of training examples = " + str(X_train.shape[0]))
print ("number of test examples = " + str(X_test.shape[0]))
print ("X_train shape: " + str(X_train.shape))
print ("Y_train shape: " + str(Y_train.shape))
print ("X_test shape: " + str(X_test.shape))
print ("Y_test shape: " + str(Y_test.shape))

print(tf. __version__)




# run model

model = ResNet50_c(input_shape = (64, 64, 3), classes = 6)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])


model.fit(X_train, Y_train, epochs = 20, batch_size = 16, verbose = 2)

# check train accuracy is stable - i.e. similar to final epoch of model.fit()

prediction_train = model.predict(X_train)
print("Train accuracy = ", np.mean( np.argmax(prediction_train, axis=1) == np.argmax(Y_train, axis=1)))

prediction_test = model.predict(X_test)
print("Test accuracy = ", np.mean( np.argmax(prediction_test, axis=1) == np.argmax(Y_test, axis=1)))

# save my model
model.save('SIGNS_resnet_model_20_epochs')

# load back in
# pre_trained_model = tf.keras.models.load_model('SIGNS_resnet_model_20_epochs')

# check that predict() function acts as you'd expect it to
i=3
prediction_3_direct = model.predict(X_test[[3]])
prediction_3_from_all_preds = model.predict(X_test)[3]

print("Class prediction vector [p(0), p(1), p(2), p(3), p(4), p(5)] = ", prediction_3_direct)
print("Class prediction vector [p(0), p(1), p(2), p(3), p(4), p(5)] = ", prediction_3_from_all_preds)

This long post is my distillation of various discussions on two posts to try and get to the bottom of this. The take-home is: training = true is wrong when hardwired into batch-norms. It produces peculiar behaviour. There’s an MWE at the bottom that illustrates this. The following discusses what the issue is with some details on how batch-normalisation works.

To follow up on this: I just wanted to explain batch-norm (as I understand it) to see concretely what may be happening here.

In regular gradient descent, the activations of the next layer l are calculated from the previous one l-1 as: a^l = g(z^l) = g(W^la^{l-1} + b^l), where g is the activation function (can be relu or whatever).

In batch-norm, you normalise the z’s. You drop the initial bias term and calculate z^l = W^la^{l-1}. You then calculate a normalised z - \hat z - as:

\hat z^l = \frac{z^l-μ^l}{\sqrt{(σ^2 )^l+ε}}.

\mu^l is a vector - one for each neutron - of means of z^l taken across the particular mini-batch. \sigma^l is the variances of the same.

This normalised value is then scaled by two more learnable parameters: \tilde z = \gamma^l \hat z^l + \beta^l. Finally the activations of the layer are a^l = g(\tilde z^l).

I think the key term here is the mean and variances. These are calculated for each mini-batch. During training, running values of these are stored. During inference, these stored values are used (as these correspond to the learned parameters). If you set training = true in batch-norm layers when handing just one example, you will effectively ‘normalise’ that one layer. In other words your batch norms, in theory should return just the ‘bias’ term \beta^l, because \hat z^l = 0 since z^l = μ^l for one example.

I thought this might mean all predictions for any input are equal but this is not the case. My guess is that batch-norm applied to convolution layers is not so simple as I’ve presented above and (maybe) the means apply across channels? Potentially preventing the normalisation terms disappearing as I’ve suggested in the previous paragraph. But I’ve never seen the nuts and bolts of a conv-net batch norm implementation so am really not sure on this point.

The main point is: I’m fairly sure training = true is the issue here and that it shouldn’t be part of the definitions of identity or convolutional blocks.

Edit:

I did a bit more sleuthing and yes, convolution networks do take means across channels (that’s the axis = 3 bit). There’s a nice explainer here. This is why different inputs give different predictions with this model even when training = true and you only hand the predict() function one example.

To test all of the above I built a basic MWE, see below. This shows how, if you run a neural network with batch norm that only has densely connected layers, set training = true and then run predict - all inputs give the same predictions while how you predict (on one or all inputs in X_test) changes the result. Comment out the relevant lines to convert the fully connected net to a conv-net and you will see that training = true no longer results in all predictions being equal (because z^l \neq \mu^l any more) but you still see the behaviour that how you predict matters in this regime: pred(X_{test}[i]) \neq pred(X_{test})[I].

I think I fully understand this now - thanks for everyone’s input. Happy to answer any issues!

import tensorflow as tf
import numpy as np
import scipy.misc
from tensorflow.keras.applications.resnet_v2 import ResNet50V2
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.resnet_v2 import preprocess_input, decode_predictions
from tensorflow.keras import layers
from tensorflow.keras.layers import Input, Add, Dense, Activation, ZeroPadding2D, BatchNormalization, Flatten, Conv2D, AveragePooling2D, MaxPooling2D, GlobalMaxPooling2D
from tensorflow.keras.models import Model, load_model
from resnets_utils import *
from tensorflow.keras.initializers import random_uniform, glorot_uniform, constant, identity
from tensorflow.python.framework.ops import EagerTensor
from matplotlib.pyplot import imshow

from test_utils import summary, comparator
import public_tests
    
def ResNet50(input_shape = (64, 64, 3), classes = 6):
    """
    Basic NN

    Arguments:
    input_shape -- shape of the images of the dataset
    classes -- integer, number of classes

    Returns:
    model -- a Model() instance in Keras
    """
    
    # Define the input as a tensor with shape input_shape
    X_input = Input(input_shape)
    
    # to test batch-norm effects with convolutions, use next line and comment out following two; to see batch norm on regular NN do the opposite
#     X = Conv2D(64, (7, 7), strides = (2, 2), kernel_initializer = glorot_uniform(seed=0))(X_input)
    X = Flatten()(X_input)
    X = Dense(64*7*7)(X)
    
    ## if you include the training=True line, you will see that the predictions are the same for all inputs (e.g. all of i = 0,1,2,...)
    ##    and you will see that the predictions differ depending on whether you hand predict() all of X_test or just X_test[i]
#     X = BatchNormalization()(X)
    X = BatchNormalization()(X, training = True)
    X = Activation('relu')(X)
    
    # to test batch-norm effects with convolutions, use next line; to see batch norm on regular NN comment out
#     X = Flatten()(X)
    X = Dense(classes, activation='softmax', kernel_initializer = glorot_uniform(seed=0))(X)
    
    
    # Create model
    model = Model(inputs = X_input, outputs = X)

    return model


# load data
X_train_orig, Y_train_orig, X_test_orig, Y_test_orig, classes = load_dataset()
X_train = X_train_orig / 255.
X_test = X_test_orig / 255.
Y_train = convert_to_one_hot(Y_train_orig, 6).T
Y_test = convert_to_one_hot(Y_test_orig, 6).T


# build and fit basic NN model
model = ResNet50(input_shape = (64, 64, 3), classes = 6)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, Y_train, epochs = 3, batch_size = 32)



# compare predictions from running on whole test set to specific example only
# key points: with training = true prediction_i_direct will be same for all examples (the model loses all input info because the means in the batch norm = the inputs so you are basically just fitting constants gamma)
#             with training = true, how you predict changes the answer since the BN means and variances change dependening on whether you feed model all X_test or just X_test[i]

for i in range(0,3):
    prediction_i_direct = model.predict(X_test[[i]])
    prediction_i_from_all_preds = model.predict(X_test)[i]
    
    print("Pred vector - inference from one example  = ", i, " = ", prediction_i_direct)
    print("Pred vector - inference from all examples = ", i, "= " , prediction_i_from_all_preds)