In the logistic regression exercise with recognizing cats, there are 209 training examples, and each image unrolls into a 12288-dimensional vector. If you turn each vector into a 12289 - dimensional vector by adding the bias unit +1 as, say, the 0th element of the vector, then the gradient of the cost function turns out to be a linear combination of the 209 vectors:
The only way in which this gradient could equal zero for any finite value of the parameters w and b, is if these vectors are linearly dependent. But this can only be guaranteed if the number of training examples is strictly greater than the dimension of each vector - that is, we would need at least 12289 training examples, but we only have 209. So, how can we know that a minimizer of the cost even exists? Assuming the vectors are linearly independent (which they most likely are), the only way to miminize the cost is to make each c_i = 0, which would mean the parameters w_j, b would all have to be either positive or negative infinity.
Hi, @SonaA. We do not need 12,288 training examples, or (12,289). That is the size of the input vector, (equaling the number of pixels in an image, of which their are 209 example images. In terms of statistics, there are 12,288 right-hand side variables–a large regression. Plus the constant, or bias unit, for which the data point is 1. You can think about the logistic regression (nonlinear) as an ordinary linear regression problem with the following transformation:
y = \frac {1}{1 + e^{-\left(w^T X + b\right)}} \Leftrightarrow \ln \left(\frac{y}{1-y}\right) = w^T X + b
In the expression on the right, y is interpreted as a probability with y \in
\left(0,1\right) whereas in the logistic regression expression on the left, y \in \{0,1\} . That is one of the reasons that MSE loss cannot be used in the logistic regression case; instead we use binary cross-entropy derived from maximum likelihood.
But the point here is one of identification. Feature matrix X has dimension \left(n_x,m\right) = \left(12288, 209\right) and y is \left(1, 209\right)$. So if we applied least-squares to an MSE loss function for the estimation problem on the right, we have the problem of under-identification. We have 12288+1 parameters to estimate (pixel values and the constant “data” of 1) with only 209 examples (or “observations”). To many unknowns.
If this makes sense to you, the “normal equations” of least-squares will not have a unique solution. And inverting such huge matrices is, well, computationally difficult and/or slow. But gradient descent works on the logistic regression expression on the left (and the problem on the right, but we still have a uniqueness issue).
This is all to give you some intuition to help with your puzzle. I hope that it helps!