Thank you Jamal for the detailed response. You have answered my 1st question, though I still have some confusion on the second question. Let me try to put it together.

So, in the model summary there are only 3 layers that have parameters. Conv2d, Batch Normalization and Dense layer, and each layer has 4,736, 128, 32,769 parameters respectively. Summing up these parameters we have 4,736 + 128 + 32,769 = 37,633 parameters in total, which we can also see in model summary **‘Total params: 37,633’**. Under **‘Total params’**, we can also see **‘Trainable params: 37,569’** and under that we can see '**Non-trainable params: 64’**.

Now based on these numbers, we can clearly see that **Non-trainable params = Total params - Trainable params**. Thus, from this I’m making an inference that the Non-trainable params are subset of Total params, which means that Non-trainable params must come from the following layers; Conv2d, Batch Normalization and Dense Layer. Please correct me if I’m making the wrong inference, but if I’m right, then my second inference on top of this is as follows.

There are a total of 4 distinct parameters in a batch normalization layer. These are Gamma, Beta, Moving Mean, and Moving Variance. Among these 4, Moving Mean and Moving Variance are 2 parameters that are not trained or learnt, rather these are calculated using the same formula over and over again. Thus implying that we have 2 parameters for 32 channels that are non-trainable, resulting in **2x32 = 64 non-trainable parameters** as seen in model summary. Please correct me if I’m wrong.

For the 3rd question that I asked, I understand that normalization is applied on the 3rd axis, which stores the channels data. Again, please correct me if I’m making the wrong inference on how that is computed. So here’s my intuition on this computation.

Lets say we get this **3x3x3** output (Z) for **‘i’** training examples, and now we have to apply batch normalization on the 3rd axis for this output. And lets say we will apply normalization by calculating the mean and variance and then use them to calculate the norm and then use that to calculate the value of **Z(tilde) = γZ(norm) + β**.

So, in order to do that we will first take 1x1x1 red boxes (z), across all training examples **‘i’**, and compute mean, then variance, then norm, then z(tilde), to find the normalized values for all 1x1x1 red boxes across all training examples **“i”**. Then, we will do the same for all of the remaining 35 red boxes individually. Once we have done that, we will do the same for all of the 36 green boxes and then the same for all blue boxes, across all training examples **‘i’**.

So, I’ve tried to explain my intuition for both questions that I asked. Please correct me if my intuition or inference wrong. Thank you!