hello. I’m trying to implant the batch normalization to my neural network model but I have a problem in understanding the chain rule to get the gradients of gamma and betha. lets say in the output layer we have x; xhat=(x-mean(x))/std(x) ; z=z*gamma+betha; a=f(z) (f is activation function)

in the chain rule first we have to compute dloss/dz. first question, is dloss/dz = (df/dz)f (z) *(a-y)/m?

second question, in the final step we get dloss/dx, so if we move to the pervious layer to apply the chain rule again, is this computed dloss/dx equal to dloss/dz in the pervious layer?

Thank you very much