I’ve been trying to derive the gradients for batch normalisation since I want to implement it on my own. I’ve successfully found the gradients with respect to beta and gamma, but can’t figure out the gradients for weights, biases (not removing them, just yet) and the dA for the previous layer (the layer preceding this, which which will use this for its own backprop step). The main problem I’m facing is figuring out the derivative of the mean and the standard deviation with respect to Z (linear combination with weight and bias).
Thanks! I was able to figure it out using this very blog post. Interestingly, only the computation for dZ changes in the entire process. dW and db are still calculated in the same way. Pretty amazing how these things work out in the end.
I’m trying to do something similar and add batch normalization to L-layer neural network that we created in the first course. I would appreciate it if you could possibly share how you changed the dZ .