Observations of Random Initialization (Assignment)

Hi, I am having trouble understanding some of these observations for random initialization.

Can you explain what these means in a simpler way?

  • The cost starts very high. This is because with large random-valued weights, the last activation (sigmoid) outputs results that are very close to 0 or 1 for some examples, and when it gets that example wrong it incurs a very high loss for that example. Indeed, when log(𝑎[3])=log(0)log⁡(a[3])=log⁡(0), the loss goes to infinity.
    Why is it that the last activation (sigmoid) outputs results that are very close to 0 or 1 for large random-valued weight?

Thank you.

Hi, @shamus.

The input to the last sigmoid activation is calculated as z3 = np.dot(W3, a2) + b3. What values does a3 = sigmoid(z3) take when z3 is large (either positive or negative)?

image
(source)

Let me know if that was helpful :slight_smile: