How to identify which normalisation & standardisation process is good?

I tried two other type of normalisation & standardisation apart from the one defined in the assignment workbook. Please refer the below output for all three below

  1. Default given in the assignment - i.e., divide each channel value by 255
    a) cost output


    b) Learning rate vs Cost function

  2. Per channel considering all data for that channel to Normalize and standardize
    a) cost output


    b) Cost Vs Leanring rate

  3. Per image per channel-wise normalization and standardization
    a) Cost function


    b) Cost vs Learning rate

As we see cost improves much better and quickly from 1 to 2 to 3, but accuracy doesn’t improve or degrades. Can we conclude the applied normalisation & standardisation is not doing any good as test accuracy is not improving and also on top of it the model is getting more overfitting as train accuracy reached to 100%.

Also please help me validate by normalisation and standardisation is applied correctly or not before flattening? And please suggest if there is better way of doing it.

  1. Before fattening the array normalise & standardisation the data channel-wise considering one at a time
train_channel_mean = []
train_channel_std = []
test_channel_mean = []
test_channel_std = []

for i in range(train_set_x_orig.shape[3]):
    train_channel_mean.append(np.mean(train_set_x_orig[:,:,:,i]))
    train_channel_std.append(np.std(train_set_x_orig[:,:,:,i]))
    test_channel_mean.append(np.mean(test_set_x_orig[:,:,:,i]))
    test_channel_std.append(np.std(test_set_x_orig[:,:,:,i]))
    

train_channel_mean = np.array(train_channel_mean).reshape(1,1,1,train_set_x_orig.shape[3])
train_channel_std = np.array(train_channel_std).reshape(1,1,1,train_set_x_orig.shape[3])
test_channel_mean = np.array(test_channel_mean).reshape(1,1,1,test_set_x_orig.shape[3])
test_channel_std = np.array(test_channel_std).reshape(1,1,1,test_set_x_orig.shape[3])


# normalize and standardize the data
train_set_x_orig = (train_set_x_orig - train_channel_mean) / train_channel_std
test_set_x_orig = (test_set_x_orig - test_channel_mean) / test_channel_std

# print(train_set_x_orig[0,:,:,0].shape)
# print(train_channel_mean.shape)
# print(train_set_x_orig.shape)
  1. per image per channel-wise normalisation and standardisation
train_x_channel_mean = []
train_x_channel_std = []
test_x_channel_mean = []
test_x_channel_std = []

for train_image_index in range(m_train):
    train_x_channel_mean.append([])
    train_x_channel_std.append([])
    for i in range(train_set_x_orig.shape[3]):
        train_x_channel_mean[-1].append(np.mean(train_set_x_orig[train_image_index:,:,:,i]))
        train_x_channel_std[-1].append(np.std(train_set_x_orig[train_image_index,:,:,i]))

for test_image_index in range(m_test):
    test_x_channel_mean.append([])
    test_x_channel_std.append([])
    for i in range(test_set_x_orig.shape[3]):
        test_x_channel_mean[-1].append(np.mean(test_set_x_orig[test_image_index,:,:,i]))
        test_x_channel_std[-1].append(np.std(test_set_x_orig[test_image_index,:,:,i]))

# print(train_x_channel_mean[0])

train_x_channel_mean = np.array(train_x_channel_mean).reshape(m_train,1,1,train_set_x_orig.shape[3])
train_x_channel_std = np.array(train_x_channel_std).reshape(m_train,1,1,train_set_x_orig.shape[3])
test_x_channel_mean = np.array(test_x_channel_mean).reshape(m_test,1,1,test_set_x_orig.shape[3])
test_x_channel_std = np.array(test_x_channel_std).reshape(m_test,1,1,test_set_x_orig.shape[3])

train_set_x_orig = (train_set_x_orig - train_x_channel_mean) / train_x_channel_std
test_set_x_orig = (test_set_x_orig - test_x_channel_mean) / test_x_channel_std

# print(train_set_x_orig[0,:,:,:])
# print(train_x_channel_mean[0,:,:,0].shape)

Apologies for such a long question

2 Likes

Hello, @VivekKapoor,

Approach 1 & 2 got 70% test accuracy but approach 3 got only 60%, so the first two approaches are better! Approach 1 is the most common, whereas approach 2 is like how we do batch normalization - channel-wise over the whole batch (only here we are over the whole dataset). Therefore, the first two approaches are both acceptable.

You may cross check your results with the outputs using my code, by replacing my X with your train_set_x_orig. I also included two np.array_equal statements for a quick check of my own code. :wink:

import numpy as np

X = np.arange(2*3*3*3).reshape(2, 3, 3, 3)

# Your approach 2
X_mean = X.mean(axis=(0,1,2), keepdims=True)
X_std = X.std(axis=(0,1,2), keepdims=True)

# Check
print(np.array_equal(X[..., 0].mean(keepdims=True), X_mean[..., 0]))
print(np.array_equal(X[..., 0].std(keepdims=True), X_std[..., 0]))

# Normalize
X_norm = (X - X_mean) / X_std

Cheers,
Raymond

PS1: You may change it to axis=(1,2) for cross-checking your approach 3. My check will need to be changed accordingly or may just be dropped.
PS2: I wanted to show you, with my code, the use of axis, keepdims, ... and np.array_equal. They should be helpful.

3 Likes

Well, in addition to Raymond’s points, before we say that the chosen method does not have any benefit, the question would be what happens if you do no normalization at all: just use the raw uint8 pixel values from 0 to 255.

Please try that and let us know what happens. Science! :nerd_face:

1 Like

Hi @paulinpaloalto - Thanks for directing me to think over this point as well. Please find the output for no normalisation, can you please help me interpret the result why would cost output as nan was bit confused?

Please find my thought on the above result

Given cost is given by expression -


And as we know A is defined as below
image
And we know range of A is [0,1] for each input and output pair
So, basis above two equation, cost range will be from [-209, 209] (given we have 209 train samples), I understand Z = W.T 𝑋+b would overshoot to a very large number because of no normalization but still A would be small and hence the cost, then why is Cost a nan?

And second why is test accuracy increasing when cost is nan?

Thanks @rmwkwok - I was not very familiar with how axis worked earlier though had read it multiple times but each time was alot confused :smile: . But now with your sample code I understood it and with this code also become crisp and readable :slightly_smiling_face:

1 Like

No, the range of the cost is [0, \infty). The cost is the average of the loss values across the samples: that is the function of the \frac {1}{m} in the cost formula. The individual loss values are all positive and each one has the same range because the range of log(z) on the domain [0, 1] is [0, -\infty).

The reason you get NaN for the cost is that at least one of your output z values must have “saturated” the sigmoid function. The output value of sigmoid can never be exactly 0 or 1 in pure mathematical terms, but we are dealing with the finite approximation of \mathbb{R} that is 64 bit floating point here. So the values can round to 0 or 1. It turns out that in 64 bit floats, sigmoid(36) will give you 1.0 as the output. When that happens, the cost becomes NaN because one of the terms of loss value becomes 0 * -\infty and that is Not A Number.

The gradients are still valid even if the actual J value is not, so gradient descent still works. The only real purpose of computing J directly and printing it is that it gives you a proxy for whether you’re getting reasonable convergence or not. But I’m actually surprised you get convergence in this case. Did you have to manipulate the learning rate? The reason for normalization is that the gradients are so steep with inputs in the range 0 to 255 that it’s really hard to get convergence. When I tried this kind of experiment the first time I hit this question back in maybe 2016, I had to use such a small learning rate to avoid divergence that the results were terrible. It took so many more iterations and never really made much progress. But that was not on this specific dataset: it was one from the original Stanford Machine Learning course.

1 Like

Thanks @paulinpaloalto for your response, yes I realised it was by bad, I made a silly mistake on the range calculation of log and cost function :slightly_smiling_face:.

No, I didn’t update the learning rate, it was the default value num_iterations=2000, learning_rate=0.005

But thanks again for making me run this, to get better understanding to why normalisation is important as many of us assumed and take it for granted it is important and blindly apply it without even realising why part of it :slightly_smiling_face:.

I hope I’m not going too far into the weeds here, but there are some more things to say based on the above:

You can add logic to your cost implementation to avoid the NaN issue caused by sigmoid rounding to 1 or 0. Here’s a thread about that.

There are also other common methods of normalization that are sometimes used in cases dealing with RGB images:

  1. Instead of just dividing by 255, you can rescale and shift so that the range is [-1, 1] with a mean of 0.
  2. You could do “mean normalization” so that the result is Gaussian with \mu = 0 and \sigma = 1.

You could add those to your suite of experiments and see the results. But the overall point is that the simplest method of just scaling by 255 gives as good results as any of the others and it is the simplest code to write and also the cheapest in terms of compute cost. So why not go with the simplest and cheapest? :nerd_face:

A reminder: Any time you modify the magnitude of the training set values, that will impact the gradients, and you’re going to have to adjust the learning rate and the number of iterations.

Oh, sorry, I just realized that this entire exercise has been using the Logistic Regression implementation from DLS C1 W2 A2. I think that is too easy a case to get a realistic picture of the importance of normalization. Gradient Descent is (apparently) a lot more tractable for LR than it is for multilayer neural networks. My suggestion would be to try this exercise again once you complete Course 1 and use the Deep Neural Network Application assignment (W4 A2) as the basis. It literally uses the same input image datasets, but applies both 2 layer and 4 layer neural nets to the same problem.

I thought it was surprising that you got good performance (even with the NaN cost values) in the case that you used the raw unnormalized images. I’ll bet you that it doesn’t work so well when you try that with the 4 layer network. I’d be very surprised if you get convergence at all unless you make changes to the learning rate.