Hi, I am having trouble understanding some of these observations for random initialization.
Can you explain what these means in a simpler way?
- The cost starts very high. This is because with large random-valued weights, the last activation (sigmoid) outputs results that are very close to 0 or 1 for some examples, and when it gets that example wrong it incurs a very high loss for that example. Indeed, when log(š[3])=log(0)logā”(a[3])=logā”(0), the loss goes to infinity.
Why is it that the last activation (sigmoid) outputs results that are very close to 0 or 1 for large random-valued weight?
Thank you.
