C4 W1 Lab2 Convolutional Neural Networks

In the lab, I created a neural network consisting of convolutional layers followed by fully connected dense layers. In the model’s summary I can see that there are a total of 37,633 parameters, that are coming from Conv2d, Batch Normalization, and Dense Layers. I believe that the parameters coming form Batch Normalization Layer are Gamma and Beta. I have 3 questions:

  1. If the parameters coming from batch normalization layer are Gamma and Beta (2 params), and if we have a total 32 channels that we are normalizing for. Then the total number of parameters coming from batch normalization layer should be 2x32 = 64. How do we have 128 params then?

  2. In the bottom of the image, I can see 64 Non-trainable parameters. What are these non-trainable parameters?

  3. In the lab, it says “BatchNormalization: for axis 3”. How is that computed?

Hey @Ammar_Jawed,

Well it’s a nice question to be honest, but let’s break your question and i will say my point of view and sure other mentors will add more notes if needed so that we all gain the benefits from the community.

  1. For the first question during the batch normalization layer you are correct you calculate Gamma and Beta, but you also calculate two additional params which is “Moving mean (μ or mu)” and “Moving variance (σ² or sigma squared)

    • Moving mean (μ or mu): Is an estimate of the mean activation values for a specific feature (or channel) over a batch of training samples. So you basically as the model processes each mini-batch of data it computes the mean of activations within that mini-batch for each feature.

    • Moving Variance (σ² or sigma squared): is an estimate of the variance of activation values for a specific feature (or channel) over a batch of training samples. So again like the moving mean, the moving variance is computed within each mini-batch during training.

  2. By coming to your second question there are several reasons to see non-trainable params and i will give you few possible explanations for these non-trainable parameters:

    • Pooling Layers: Pooling layers, such as MaxPooling2D which you are using may have non-trainable parameters. These parameters can represent the size of the pooling window or the strides used for pooling

    • Reshape Layers: If your model includes layers that reshape the data (e.g., Flatten layers or Reshape layers), there might be non-trainable parameters associated with specifying the new shape of the data.

  3. The statement “BatchNormalization: for axis 3” means that the Batch Normalization is applied along the third axis of the input data which means that the normalization is performed independently for each channel or feature map. So you need to consider how the axes are often referred to in CNN:

    • Axis 0: Batch dimension (the number of examples in the batch).

    • Axis 1: Height dimension (the spatial dimension along the vertical axis of the feature maps).

    • Axis 2: Width dimension (the spatial dimension along the horizontal axis of the feature maps).

    • Axis 3: Channel dimension (the depth or number of channels in the feature maps).

I hope it’s more clear for you now and feel free to ask for more clarifications.

1 Like

Thank you Jamal for the detailed response. You have answered my 1st question, though I still have some confusion on the second question. Let me try to put it together.

So, in the model summary there are only 3 layers that have parameters. Conv2d, Batch Normalization and Dense layer, and each layer has 4,736, 128, 32,769 parameters respectively. Summing up these parameters we have 4,736 + 128 + 32,769 = 37,633 parameters in total, which we can also see in model summary ‘Total params: 37,633’. Under ‘Total params’, we can also see ‘Trainable params: 37,569’ and under that we can see 'Non-trainable params: 64’.

Now based on these numbers, we can clearly see that Non-trainable params = Total params - Trainable params. Thus, from this I’m making an inference that the Non-trainable params are subset of Total params, which means that Non-trainable params must come from the following layers; Conv2d, Batch Normalization and Dense Layer. Please correct me if I’m making the wrong inference, but if I’m right, then my second inference on top of this is as follows.

There are a total of 4 distinct parameters in a batch normalization layer. These are Gamma, Beta, Moving Mean, and Moving Variance. Among these 4, Moving Mean and Moving Variance are 2 parameters that are not trained or learnt, rather these are calculated using the same formula over and over again. Thus implying that we have 2 parameters for 32 channels that are non-trainable, resulting in 2x32 = 64 non-trainable parameters as seen in model summary. Please correct me if I’m wrong.

For the 3rd question that I asked, I understand that normalization is applied on the 3rd axis, which stores the channels data. Again, please correct me if I’m making the wrong inference on how that is computed. So here’s my intuition on this computation.

Lets say we get this 3x3x3 output (Z) for ‘i’ training examples, and now we have to apply batch normalization on the 3rd axis for this output. And lets say we will apply normalization by calculating the mean and variance and then use them to calculate the norm and then use that to calculate the value of Z(tilde) = γZ(norm) + β.

So, in order to do that we will first take 1x1x1 red boxes (z), across all training examples ‘i’, and compute mean, then variance, then norm, then z(tilde), to find the normalized values for all 1x1x1 red boxes across all training examples “i”. Then, we will do the same for all of the remaining 35 red boxes individually. Once we have done that, we will do the same for all of the 36 green boxes and then the same for all blue boxes, across all training examples ‘i’.

So, I’ve tried to explain my intuition for both questions that I asked. Please correct me if my intuition or inference wrong. Thank you!

Hi @Ammar_Jawed,

I’m pleased to see that you’ve grasped the concept now. Your insights are indeed on point. The “64 params” exclusively originate from the batch normalization layers, as you aptly described in your previous paragraph.

Moreover, your understanding of applying batch normalization along the 3rd axis is accurate.


1 Like

I am getting one difference in the output as compared to the expected answer. The #param for BatchNormalizaton I am getting is 256 vs the expected answer. I dont see anything in the keras tfl batchNormalization on what this param means or how to control it. Can you please clarify?

Nevermind, i was passing axis = -3 (typo vs axis = 3). Now it is working.

Great, thanks for the help.