I am struggling to understand some aspects of the Batch-Normalization discussed in the lectures.

1.) It was mentioned that the b parameter is redundant as it will be subtracted by the normalization. In the same way, shouldn’t one of the w elements be eliminated as well? Consider weights in layer l as w=[w1,w2,w3,w4]. Then using the batch-normalization, weights w=[10,20,30,40] are equivalent to w=[0.1,0.2,0.3,0.4], since the scale difference is going to be scrapped. As a consequence, one of the weights in w should be simply fixed, and the other weights should be defined as a multiple of this weight, for example w1=1 and w=[w1, 2*w1, 3*w1, 4*w1].
Actually, not doing this and using the L2 regularization, the scale of weights is free to be pushed towards epsilon (with no effect on the resulting predictions) until the NN will blow out from rounding errors.

2.) Why is it called Batch-Normalization, if the idea is to normalize the layer activations a (or arguments z)? Ok, it does put all the mini-batch activations on the same scale, but the idea is equally useful even when not doing the mini-batching?

3.) If we were generous enough to make the gamma scaling of the hidden layers activations arguments z as a learnable parameter, why not to allow it for the input layer as well?

There is a lot to discuss about Batch Normalization. I don’t have complete answers for all your questions, but here are some thoughts:

I think you’re at least slightly missing the point here. The normalization is based on the actual activation values that are seen.

It’s happening on the linear activations of the hidden layers, so it applies at whatever style of batch processing you are doing: minibatch or full batch gradient descent. It wouldn’t make sense for the Stochastic GD case. In the sense that it becomes a NOP if you’re computing the mean of a single value.

The point is that BN applies to the linear activation outputs, so it can apply at any of hidden layers. If you think about it, there really isn’t an “input layer”: it’s the first hidden layer, right? And BN can be applied to the activations of that layer.

If we fix an element of W (like W1 in your example), one problem I see is that there is no guarantee that the resulting components of z will be of more or less uniform size. There may be cases where even only z_1 can become large due to off-diagnoal elements of W. Then, we may have to normalize again. Another problem of fixing W1 at 1 is that the model will never converge if the true minimum is when W1 is very small compared to the other elements or 0.

As for the dominating/negligible regularization term compared to the original cost term, I think that its relative effect is controlled by \lambda. If W has too many big elements, \lambda has to be small so the regularization term is meaningfully big but not completely dominating the 1st term.

I also agree with you in that some of W must have gone into \gamma. I just started to learn machine/deep learning myself, but I don’t feel that people would worry too much about having a few extra parameters if the model performs better.