Hi Everyone,
In the week3 last lecture named Random Initialization, it was mentioned that usually all the parameters of w in the layer is chosen as np.random.randn(5,10) * 0.01 (assume 5 nuerons and 10 parameters of w). It was taught that the reason why we choose a samller number like 0.01 to multiply is because, if we use large number like 100 to multiply, the slope of the sigmoid curve would be close to 0 and thus gradient descent will be slow.
But, even when we use smaller value like 0.01 or any value, wouldn’t the gradient descent slow down as the slope approaches 0 ? How can just using a value like 100 be problematic ?
Thanks in advance