I do not understand why gradient descent does not work when the matrix is initialized with zeros. I understand from the first assignment that the the layers output value (z = 0) which then is passed to the last layer (sigmoid layer) and the activation then is sigmoid(0) which is equal to 0.5.
And I also understand that the loss function will always output the same value regardless of the truth value of the training example.
What I do not understand is that the general form of gradient descent is (W = W - learning_rate * dw). And in the case of sigmoid function (dw = 1/m * np.dot(X, (A-Y).T))
The vector A should be a column vector of the value 0.5 and Y is a column vector of ones and zeros so in general dw should not be all zeros.
What am I missing?