Np.random.randn(5,10) * 0.01

Hi Everyone,

In the week3 last lecture named Random Initialization, it was mentioned that usually all the parameters of w in the layer is chosen as np.random.randn(5,10) * 0.01 (assume 5 nuerons and 10 parameters of w). It was taught that the reason why we choose a samller number like 0.01 to multiply is because, if we use large number like 100 to multiply, the slope of the sigmoid curve would be close to 0 and thus gradient descent will be slow.

But, even when we use smaller value like 0.01 or any value, wouldn’t the gradient descent slow down as the slope approaches 0 ? How can just using a value like 100 be problematic ?

Thanks in advance

The slope of the sigmoid is only near zero at very large positive and negative values.

Around the origin, the slope is the maximum.

This helps the gradient descent process work more quickly.

How exactly the z value (wx + b) being near 0 or smaller value (sigmoid curve slope high) help gradient descent, which takes forward or a backward move by alpha * dj/dw, run faster ?

The magnitude of the slope (i.e. the gradient) is the highest around the origin. This provides the highest amount of change in the weights during training.