There is no well-defined formula taught in the lectures, you can calculate it intuitively. In expansion layer - 1x1x3 x18(expansion filter) + 2 x 18 (Batch Normalization’s 2 parameters gamma and beta for shifting and scaling)
same we do for the next layer- In depthwise conv - (3 x 3 x 18) x18 + 2 x 18 In pointwise - (1 x 1 x 18) x output channels + 2 x output channels.

(in relu layer there is no learnable parameters)

Sometimes only two BN is used i.e. after depthwise conv and pointwise.

Whereas in some transfer learning models only one BN is used i.e. after depthwise conv.

I think in research paper all three BN are used, so calculate according to your practical model.

Take layer order from below->

def bottleneck_block(x, expand=18, squeeze=n_out):
m = Conv2D(expand, (1,1))(x)
m = BatchNormalization()(m)
m = Activation('relu6')(m)
m = DepthwiseConv2D((3,3))(m)
m = BatchNormalization()(m)
m = Activation('relu6')(m)
m = Conv2D(squeeze, (1,1))(m)
m = BatchNormalization()(m)
return Add()([m, x])

It is my first answer in the community, let me know if this was helpfull.

Thanks @tarunsaxena1000 for replying. I have some questions. Why BatchNorm has 2 parameter ? (I kind of forget what BatchNorm does) Why depthwise conv second term multiply by 28 ? in last step, why do we have 2 x output channels ?

To be honest, I am still confused about how the whole thing works. Thanks for the code. Didn’t know “expand” step is a Conv2D layer. I guess I will go rewatch lecture videos and some other video on youtube to get better understanding of MobileNetv2

Faster Convergence: By reducing internal covariate shift, Batch Normalization allows for faster training by stabilizing and accelerating the learning process.

Regularization: Acts as a regularizer by adding noise to the activations of each layer, which reduces the need for Dropout and helps prevent overfitting.

Improved Gradient Flow: Normalizing the activations ensures that gradients propagated during backpropagation are more stable and leads to faster convergence.

remember we studied aboud gradient convergence from unnormalized data look like below->

Without γ(gamma) and β(beta), the normalized output will always have zero mean and unit variance. This constraint could limit the ability of the network to represent complex functions, as it forces all outputs to follow this fixed distribution.

By introducing γ(gamma) and β(beta), the network can learn to adjust the mean and variance of the normalized activations, thus restoring the capacity to represent a wide range of functions.

it should be 18, sorry it was a typing error on my part. Thanks for pointing it out.

Channel-wise Normalization

Batch Normalization normalizes the activations of each channel independently across the mini-batch. Here’s how it works:

For each channel 𝑐, BN computes the mean 𝜇𝑐 and standard deviation 𝜎𝑐 across the mini-batch.(These are non-trainable parameters)

It then scales and shifts the normalized values using learned parameters (𝛾𝑐 and 𝛽𝑐), where 𝛾𝑐 scales the normalized value and 𝛽𝑐 shifts it. (these are trainable parameters)

Parameters for Batch Normalization

The parameters for Batch Normalization are:

Scale Parameter (γc): One per channel. It’s a learnable parameter that scales the normalized value.

Shift Parameter (βc): One per channel. It’s a learnable parameter that shifts the normalized value.

thats why we do 2 x output channels

Kindly mark it as the ‘solution’ if it answers your questions. let me know if you have any other queries.