Course 1 Week 3

Why are the random values we initialise the W1 and W2 matrices by divided by 100? Why can’t we just use the generated values?

It turns out that smaller values are better for convergence of gradient descent. You can also have problems with large input values causing NaN values for cost, because you get “saturation” of the sigmoid function. It is never actually equal to 0 or 1 from a mathematical point of view, but it can happen in floating point because of rounding. If you get exactly 1 as the output of sigmoid, that will cause the cost function to give NaN values.