Hello Mentors, There is something in this part of the assignment I don’t understand.
We have to do backpropagation in order to update the parameters. Theoretically, we should start at the last layer(the output layer) in this case layer L and apply the function LINEAR SIGMOID. I don’t understand why we start at Layer L-1 which is the hidden layer. Normally for this layer, we have to implement the LINEAR → RELU function. Am I missing something here?
I actually understood the forward propagation because we used there the combination of for loop with the range function that excludes the last layer. But in the case of backpropagation, the last layer is the “first layer”.
Hi, Code with Africa.
We do Backprop to propagate the total loss back in the NN (Neural Network) in order to identify how much of the loss every node is responsible for. It also helps in computing the derivatives of the cost w.r.t parameters that we have used while implementing the forward pass. Here, we mainly implement the ‘chain rule of calculus’.
In the instructions given in the notebook, it is clearly mentioned on how are you building a neural network:
Reminder: The general methodology to build a Neural Network is to:
- Define the neural network structure ( # of input units, # of hidden units, etc).
- Initialize the model’s parameters
- Loop:
- Implement forward propagation
- Compute loss
- Implement backward propagation to get the gradients
- Update parameters (gradient descent)
Now, your query on L, which simply signifies, a sort of vector output of the loss function and simultaneously, J is the average of the values of L across the samples that we have used in training this model. And that’s where, the factor of 1/m gets its significance in the given formula:
𝐽=−1𝑚∑𝑖=1𝑚(𝑦(𝑖)log(𝑎2)+(1−𝑦(𝑖))log(1−𝑎2))
To your second query on why we start with hidden layers: we do the same implementation in the backward what we had used in the forward direction using the hidden layers where caches have been generated as (linear_cache, activation_cache) for computation purposes. We can say, Backprop stands in opposition to forward prop.
Backprop is the essence of any neural network training that fine tunes the weights of a neural net based on the error rate which is also termed as loss.
Check this thread that discusses about functioning of backprop in detail. You will get a very clear idea on how it is implemented for : linear_activation_backward , which then calls linear_backward , relu_backward and sigmoid_backward.
I think confusions come from the differences in indexing. Please see the following diagram.
As you know, for backprop, we start from the layer L, which has a “sigmoid” activation.
Then, start iterations to update variables for hidden layers.
The layer number starts with 1, and ends L. But, Python’s index starts from 0. So, cache[0] is for the first layer, and cache[L-1] for the L-th layer.
In this sense, the last (first backprop) layer starts with retrieving cache values from cache[L-1].
Then, iterations start. The range of iteration is the Layer 1 to the Layer (L-1) in the reverse order. So, in Python, we set the range (L-1) with reversed, which accesses caches from L-2 to 0. And, as you see from the above explanations, cache[L-2] is the cache of parameters for the layer “L-1”.
So, your understanding is correct. Python indexing confused you.
A very good point Nobu Asai. Thanks for bumping in and mentioning this angle!
Thank you very much mentors it was a little bit confusing and I couldn’t move forward. But now I understand the process. I need to work on another dataset to make sure I grasped most of the concepts. Thank you !!!