Why does “logistic regression” work at all for images?

Intuitively, it seems like logistic regression (the way we implemented it at least) treats each pixel/color component independently because the cost / derivative is a pointwise comparison from training pixel component to parameter weight.

But of course they are highly correlated - you can imagine taking a photo of a cat on a larger background and rotating/cropping it differently so as to generate a ton of images each with a slightly different pixel offset. Or just slight tints to an image. I would guess you want each parameter to be able to look at the other components of the training data to actually be able to extract patterns/relationships. As it stands, I have to imagine if you trained our toy logistic “network” with many variations of shifted/tinted cats it would basically be noise.

Or am I missing something?

These are really interesting points. I totally agree that this all seems pretty much like magic, but we can demonstrate that it actually works. You could also argue that the fact that we are “flattening” the 2D images plus 3 colors per pixel into vectors would also lose all the geometric relationships between the pixels, but it turns out the algorithm can still learn to detect the patterns. Of course Logistic Regression is just the start of our ML journey here. Prof Ng presents it first as a “trivial” Neural Network, because the output layer of a feed forward neural network doing a binary classification looks exactly like Logistic Regression. But the reason that a real NN is more powerful is that the input to the LR layer is not just raw pixel color values, but “distilled” information that has been processed by the earlier layers of the network.

To your point about the coefficients needing to detect the same patterns at many different points in the image, we’ll later learn even more powerful techniques like Convolutional Networks that have the ability to pass a “filter” over all parts of the image. That will be covered in Course 4 of the DLS series.

But back to your high level points, you could actually run some experiments where you take our dataset here and perform the types of “data augmentation” that you describe: slight rotations or offsets or distortions or color “bending” and see how that affects the results. In general, what we will learn later is that “data augmentation” like that usually helps, because it gives your model more valid examples to learn from. (That will also be covered in Courses 3 and 4.) Whether it would help using Logistic Regression and our tiny dataset here, I’m not sure. Maybe you’re right that the result would just be more noise.

On the general topic of why Logistic Regression doesn’t always work that well, what it is doing is learning a hyperplane in the input space that does the best job of linearly separating the yes and no answers. Of course the issue is that there is no guarantee that any given dataset is linearly separable. The other difficulty here is that we are talking about a hyperplane in 12,288 dimensional space (at least in our specific example here), which our meager human brains don’t have much hope of being able to visualize.

Of course none of the above really counts as an answer to any of your questions, but they are not straightforward ones. Thanks for starting the discussion!

1 Like

I can see how there is maybe some correlation through the averaging cost/loss function but it feels pretty weak.

Just completed week 2 (so haven’t seen CNNs yet), but if you imagined feeding in not just the current pixel but additionally all of the 8 surrounding pixels into the first layer of the network, then feeding each neighboring 9 pixel block into the next layer, up until you have 9 mega blocks left at the final layer, I would imagine it could then capture small-scale and large-scale patterns. Or you could try more of a 16-pixel block size so it’s log4 of image width layers. Intuition tells me that it would be more expensive to compute / converge if you made every pixel depend on every other one (n^2) so instead I am imagining picking a smaller kernel size and doing this hierarchically across layers. Rampant speculation on my part - I’m sure I will learn more in a future week.

Cool! You are describing the basic idea behind Convolutional Nets, so “hold that thought” and Prof Ng will get to that whole topic in depth in Course 4.

1 Like