X = BatchNormalization(axis = 3)(X, training=training)
For exercise 2 (convolution block), we use axis = 3 and I read from earlier posts it’s because we are performing batch normalization on channels.
It is because we consider every channel a mini-batch?
No normally a batch (or minibatch) includes many examples of the dataset, complete with all the channels for each example.
You would do batch normalization along a certain axis if the magnitude of the values along that particular axis change a lot, so that specific channel needs it.