The cost starts very high. This is because with large random-valued weights, the last activation (sigmoid) outputs results that are very close to 0 or 1 for some examples, and when it gets that example wrong it incurs a very high loss for that example. Indeed, when

log(š[3])=log(0)logā”(a[3])=logā”(0)

, the loss goes to infinity.

This is the line mentioned in the observations of programming assignments. If w is large z will be large then a will be closer to one. How it will be zero? please explain