Course 2-Advanced Learning Algorithms: Question about the y-vector in neural networks

Hello-
The y-vector in the neural network seems to be confusing as explained. If each neuron represents a different “feature” in the vector “a1”, then the y-vector should be 2D. Specifically of dimension m x k (the number of neurons). or Y = m x k

It seems unlikely the same y vector would be used at each neuron, as the ‘w’ vector and b would be the same for each neuron, after fitting the model.

Please advise/explain. Thank you. M

Hello @Mohamed_Desoky

Every neuron has a different (w,b) that is learned over the training phase. The input dimensions of the neural network are determined by the dimensions of the input vector X. The number of neurons in the output layer are determined by the dimensions of the output vector y. Going by your example, if the target vector y has a shape of (m,k) then there will be k neurons in the output layer.

What we get to control are the number of hidden layers in the neural network and the number of neurons in each of the hidden layers.

We cannot use y to check the output of each neuron in the hidden layers; neither do we have a target value for each neuron in the hidden layers. The only target value that we have is y which can be compared against the output of final layer, \hat {y}. We make sure that the dimensions of \hat {y} match with that of y.

Thank you. However the “fit” of each neuron in the hidden layers still remains unclear. What is determining the “fit” for each neuron in the hidden layers? If we send an arbitrary x-vector through say the first neuron of this first hidden layer, there is a transformation occurring that is turning that x-vector into a new feature in the output of this first hidden layer. Similarly, sending that same x-vector through the second neuron transforms that x-vector into another new feature in the output of the first hidden layer. And so on through the last neuron. In the end, that x-vector will be transformed into “k” new features. And we do this for each “m” record, so the output of the first hidden layer is “m x k”. But what is determining the optimal “W” matrix and “b” vector for these “k” neurons in the first hidden layer?

More specifically, if we cannot use y to check the output of each neuron in the hidden layers as you say, then what determines fit in the hidden layers?

Please advise if my logic is correct or flawed. -Mohamed

Hello @Mohamed_Desoky

Instead of checking the fit of every neuron in the hidden layer, we use a different concept called backpropagation…wherein the derivative of Cost J w.rt w and b of every neuron in every layer of the neural network is evaluated. This lets us get away from having to find the fit for every neuron in the hidden layer, because we really dont know what the target value should be for each neuron of the hidden layer. To know more about backpropagation you can take a look here

Thank you. I’ve gone through the videos and exercises on backpropagation and while the method does make sense and is generally a review of gradient decent, I still find the concept of hidden layers still fuzzy. To be clear, the instructor gives an example where hidden layer neurons represent ‘affordability’, ‘awareness’, and ‘perceived quality’ which makes it seem like each neuron is evaluating each new feature. But it turns out this is just an example, and what really is happening is that the neural network is working as one unit to find the optimal weights at each hidden neuron to create ‘new’ features (stored in the activation matrix) for better predictive results. Do I have that right? So there’s no need to engineer the input x variables for a better fitting model, because the neural network serves that purpose?

@Mohamed_Desoky

Yes, that is correct.

Yes, and this is one of the strengths of the Neural network.

But on the flip side, this also makes it a little bit of a black box because we don’t really have much control over what these new features in the hidden layers really are - The math in the backpropagation ensures that relevant features (represented by w,b) are learnt in the hidden layers, which eventually contributes to better predictive capability at the output layer.