Questions about regularization

Keeping the absolute values of the weights at all layers “suppressed” is a good thing, because even if we use ReLU in the hidden layers, the output layer will be sigmoid (or softmax in the multiclass case) which means we still have to worry about the “flat tails” of the function. When the absolute values of the Z values at the output layer get too large, then the gradients approach zero. The values at all layers contribute to that.

In terms of going deeper into this mathematically, I have not personally tried to do that so I don’t have any direct references that I can give. Here’s a general bibliography thread about textbooks about ML/DL. I’ve heard that the Goodfellow, Bengio et al book is more mathematical. I just checked the ToC and they definitely have a chapter about Regularization.