Hello,
In the Initialization Assignment, it is mentioned " As you can see with the prediction being 0.5 whether the actual ( y ) value is 1 or 0 you get the same loss value for both, so none of the weights get adjusted and you are stuck with the same old value of the weights."
I am unable to understand why and how the statement -" As you can see with the prediction being 0.5 whether the actual ( y ) value is 1 or 0 you get the same loss value for both"- implies that - “so none of the weights get adjusted and you are stuck with the same old value of the weights.” Please help understand the concept behind this conclusion drawn.
If for any general binary classification problem(X,y), we get the same non-zero loss value for training examples for both categories of y (0 and 1), does this always mean that the weights will stop getting updated during the gradient descent (even if the cost function gradient values are non zero) ?
The real point is that if you initialize with all zeros, then the gradients are zero. That is why no learning can take place. Here’s a thread which goes through the math behind that.
The answer to your last question is generally no - having the same non-zero loss value for training examples of both categories doesn’t imply zero gradients. The statement you cite from the assignment happens to hold just because the specific dataset used in the assignment has exactly the same number of positive and negative examples so that the gradients of the positive and negative examples cancel each other. If you reduce just one example from the dataset, the gradients are no longer zero.
The main issue with zero initialization is therefore not zero gradients, but symmetry between different neurons in the same layer.