In the first week of course, in case there are multiple layers then what’s the y values for those. I believe its trying to use gradient decent to match the y values. Not able to understand how to choose the number of levels as well.

There are no ‘y’ values for the hidden layers. ‘y’ refers to the output of the NN.

We do not know what the correct hidden layer values are, because they do not exist in observations that form the training set.

Experimentation is used to chose the number of layers, and the number of units in each layer.

The goal is to get a simple enough solution that training isn’t too difficult, but while we also get good enough results.

There is no formula for this.

Gradient descent is being used to match the y values (target values available from dataset). Typically, the error between actual and target output values is computed and then this error is fed back layer by layer so that gradient descent can compute the weight changes necessary to reduce the error between the actual and target output values. I guess things will become clearer when you study the back propagation algorithm ( a popular name for a learning rule known as the generalized delta rule).

The weight update equation (for any weight in any layer) is given by: w = w - \alpha.\frac{\partial {J}} {\partial {w}}

So, we don’t really need to know y for each layer to be able to find \frac{\partial {J}} {\partial {w}}

If we know J at the output layer, then we can find \frac{\partial {J}} {\partial {w^{[l]}}} for any layer l, by applying the chain rule.

We find \frac{\partial {J}} {\partial {w^{[l]}}} at layer l (output layer) and then apply chain rule to go one layer back and find \frac{\partial {J}} {\partial {w^{[l-1]}}} at layer l-1 and so on. Notice that the J in all these derivatives is the very same J that is defined at the output layer.

Thanks @shanup for your answer.

As en exercise, I was trying to replicate week 2’s multiclass lab from scratch with numpy.

I spent some time thinking about the implementation: you answer provided some guidance…

However, if in the 1st layer I have a Relu activation function (which is non differentiable), what should be the approach in this case??

I thought about switching from Relu to linear activation, but as seen in the lectures, there is no point in having 2 layers if you are gonna have a linear activation when it could be replaced with only one.

Any suggestions?

Thanks in advance!

Kind regards

Hello Fernando @altromondo,

For ReLU you may implement its derivative in 2 steps: when x > 0, the derivative of ReLU is 1, otherwise it is 0. I think numpy has some step functions that you can use for that.

Raymond

Hello @altromondo

If you intend to do this manually, we can do it in 2 steps as already highlighted by Raymond.

There is still the issue of derivative at 0 being undefined. We could either set it as 0 or set it as 1. Here is a paper that looks at this issue, by setting it as a hyperparameter and checking its impact on the model accuracy.

In their experiment, setting derivative = 0, at 0 gave better results with SGD. But things started to even out when they used batch norm or ADAM.

Hello,

As @TMosh said correctly, y refers to the output layer. However your question can still be interpreted as “how do you interpret the values in the hidden layers”. The way most people think of this is as follows. The values in the input layer are the datapoints in your dataset. A neural network applies a matrix followed by an activation function at each layer. So the values in the first hidden layer are a transformed version of the first layer.

For example, maybe we are trying to distinguish points labeled by 0 from points labeled by 1, and if you look at the dataset, these points cannot be separated by a plane (like how we classify points in logistic regression). The idea is perhaps that by applying some transformation to the data points, it will allow for the transformed points to be separated by a plane. So the values in the 1st hidden layer are a transformed version of the points in the input layer, or in other words, they are a new set of “learned” features which hopefully are easier to interpret. If you have a 2nd hidden layer, then another transformation is applied and we get yet another learned set of features which is hopefully even better than the first (though often one hidden layer is enough).

I hope that helps!

Alex