I am trying to understand how a neural network for the cat example would look. On the videos, Professor Ng demonstrates a neural network with an input layer that contains 3 features. (x1, x2, x3). I was trying to think how the input layer would look with our cat example, that has 209 examples, but each example being a feature vector with 12.2k values. (is that the correct way to refer to each pixel, as a “value”). In our week 2 cat example, after forward propogation, we ended up with a vector that had 209 predictions, which we then back propagated to train the “w” values and so forth.

My question is if we re-did the week 2 assignment with a neural network, would we essentially make one of the neural networks for each picture, and likewise end up with a vector of 209 predictions, or does the neural network operate on each 209 (12288x1) vectors simultaneously? Sorry if this question doesn’t make sense, I’m just trying to synthesize my understanding into a few paragraphs to ask this question.

The week 2 assignment was simply using logistic regression as a medium without using any hidden layer. But in week 3, we do apply one hidden layer. The task gets complicated here because we don’t have much cat pictures available on net. In that case, we need to synthesise them through the process of data augmentation.

The rest of the of the explanation given in those links has been well justified by Paul sir. Let me know if you get an idea and we can always have them discuss over here.

As Rashmi says, we do exactly the same cat recognition task in Week 4 using both 2 layer and 4 layer neural networks, so “stay tuned” and you’ll see all this in action.

But when we make predictions, we can use vectorization to apply the network to all the samples (209 or 50 or how many you have) at once. It applies the same coefficients at each layer to all the inputs in parallel. But it’s important to realize that it’s not “one network per picture”: it’s one network that can work on all the pictures independently and in parallel through vectorization. Just think about how matrix multiplication works, when you do this:

Z^{[1]} = W^{[1]} \cdot X + b

Of course the key point is that each “sample” is a column of X. Each row of W^{[1]} is the coefficients for one output neuron of layer 1. Now draw a picture of the two matrices and “play out” how matrix multiplication works: each row of the first operand gets marched across the columns of the second operand to perform the “linear combination” at each position, right?

Thank you for the clarification!! I guess I’ll just have to hurry on through to make it to week 4 then;) If I may ask one more follow up question though, and it pertains to the following screenshot taken from the week 3 video “Computing a neural networks output.” I can see how, as you mentioned, that for layer [1] that each sample is a column of x, representing all of the input elements, and it is multiplied by a matrix of weights, in this case of size (4,3), 4 rows to represent 4 nodes, and 3 columns to represent each sample. In this way we can simultaneously compute an entire “hidden layer”

What has me confused, is how would the input layer look if it were the cat example? If we say that each sample (ex. a1 [0], a2[0] etc.) is an entire image, not each individual RGB pixel value, then that would mean the calculations for Z[1] would be Z[1] = W[1].X + b, where W[1] would be a matrix of size (4,209) (assuming the same amount of nodes on the first layer) and X would be a column vector of size (209,1). The issue I have with that, is that it seems to me like we are missing a way to individually weight each of the 12288 pixels like before, because we just have one weight applied to each image.

Or is the setup actually one that takes in an X vector of 12288 inputs, in the form of RGB pixel values, and therefore the shape of W[1] is actually (4,12288) and the size of the X vector is (12288,1). That for some reason makes more sense to me, because it gels with your clarification, that it’s not “one network per picture” but rather one network that can work on all the pictures. I guess really the heart of my question is, what is the shape of the W and X matrix and vector that go in to computing the first layer, if we were doing the cat example. Sometimes I get lost on what they mean by “example.” ( by example I refer to the subscript of a1[0] a2[0] a3[0]) Do they mean each individual element of a feature vector (in the case of week 2, each pixel), or when they say “example” do they mean each vector (again, in the case of week 2, each image), and if they mean each image, how do you create the network that can weigh the relationship of each pixel. I’m sorry if this question is diving deep into a trivial matter, it’ll be my last question about it before I do more research on my own I promise!!

Thank you Rashmi for those recommendations. I posted a follow up question if you could weigh in on it I would be soooo appreciative. Thank you for all your help.

If we are taking our inputs as 64 x 64 x 3 RGB images, then each “sample” has 12288 values (pixel color values) arranged as a column vector. So each column of X has 12288 entries (rows).

Let’s suppose that the first hidden layer has 25 output neurons. So what is the dimension of W^{[1]} in that case? It will be 25 x 12288, right? 25 rows (one for each neuron) and 12288 columns, because we need a weight (coefficient) corresponding to each input pixel value.

I really think that instead of spending your time composing complicated replies, you should do what I suggested in my last previous reply: actually draw a picture of the two matrices W^{[1]} which is 25 x 12288 in our example and the X matrix which is 12288 x 209. Then “play out” how the matrix multiply works for W^{[1]} \cdot X.