Excuse me! According to this function w[l]=np.random.randn(shape) * np.sqrt(2/n[l-1]
, how does this help in vanishing or exploding Gradients? It wasn’t clear enough to me. . Also, does changing the square root formula in accordance with the activation function affects the final output that much?
Another point, what is epsilon in numerical approximation of Gradients? how does this even help regarding the issue of vanishing/exploding gradients and the correct implementation of back propagation?
Thanks in advance