Seems like a fundamental concept to understand the shape of W and how it is applied to the matrix X to do forward propagation. If X is the shape n, m and the resulting Y is the shape 1, m, it would stand to reason that W would have to be shape n, 1.

It makes no intuitive sense that such a calculation should be good at recognizing any image other than one fixed in size and position in the frame. It would have to be the relationships between the pixels that would allow for the recognition of a cat of any size anywhere in the frame. In our equation as I understand it, the input pixels of each color are simply multiplied by a constant factor, if I’m understanding the math correctly. It seems that such a network would have approximately a 50/50 chance (that is to say, no chance) of correctly identifying a cat outside the training sample.

This brings me back to the math, which I must be misunderstanding.

What you are missing is that the w values are not just arbitrary “constants” when viewed in the larger picture: we learn them using back propagation from the training set. In other words, we are given a (hopefully large) set of “worked examples”: this is a picture of a cat and this is not a picture of a cat. Once we’re done with the training, then they are constants, but the point is they aren’t just any old numbers picked out of a hat. I realize it may seem almost like magic, but the algorithm can actually learn to find patterns in the data even with this seemingly simple structure. Note that we are also “unrolling” or flattening the data, so that in some sense the geometric relationship between the pixels is also lost. But the algorithm still does a pretty good job. Well, in the Logistic Regression case, it’s not that great, but still surprisingly good (72% accuracy even on this small training set). We will see better results in Week 4 when we use a multi-layer network.

Here the shape of w is determined by the fact that the first step is just a linear transformation of the inputs:

Z = w^T \cdot X + b

Then we apply sigmoid to turn that value into something that looks like a probability. So as you say, if X is n x m, then w will be n x 1.

When we get to real networks in Week 3, the weights become matrices, not vectors. In that case we no longer need the transpose because of the way Prof Ng chooses to define the W matrices. The analogous linear activation step will be:

Z = W \cdot X + b

Where W for the first layer will be n^{[1]} x n_x, where n_x is the number of input features in each sample (the row dimension of X) and n^{[1]} is the number of output neurons from layer 1. If this doesn’t make sense yet, just stay tuned for the Week 3 lectures. The main point I wanted to make here is that the transpose of w in LR is just an artifact of the notational conventions Prof Ng uses: any “standalone” vector is formatted as a column vector. That’s why we need to transpose w in the first formula above. It’s completely arbitrary, but that is the way Prof Ng does it. He’s the boss, so we just have to follow his lead.

Okay, I can believe that a certain variety of colors are associated with pictures of cats versus pictures that don’t contain cats. I believe I understand W now. Thanks!

The point is that it’s not just about recognizing colors. It’s about the patterns: the relationships between the pixels that define shapes and then features from shapes (e.g. two edges that meet in a point may be a cat’s ear). Cats come in lots of colors, right? But what is different about them from say dogs or trucks or pianos is that their ears (and whiskers and tails) have pretty distinctive shapes.

Recognizing a pattern would require two pixels being mathematically related, which they aren’t so far in this model. Add another layer, and sure, there will be all kinds relationships. But so far, to conceptually explain why the results are 72/28 versus 50/50, you’d have to look at qualities this network would be able to identify, like color.

I don’t understand what you mean by the pixels not being mathematically related in this model. The pixels are what they are. And, as you say, they are color values. We are learning the coefficients of an affine transformation of the form Z = w^T \cdot X + b which results in a number (affected by all the input pixels and all the coefficients) which is then fed into sigmoid to create a probability value for each image. Because of our training labels, those probabilities are interpreted as the probability that the image is a picture of a cat.

When we get to actual Neural Networks later in this course and have multiple layers, the output layer still looks exactly like Logistic Regression, but now the inputs can be thought of as distilling information about the image instead of just the raw (well, normalized) pixel color values.

I grant you that all this seems pretty much like magic, but you can demonstrate that it works.