Hi everyone, I have the following doubt. Some of the derivatives, for example dz = a(1-a), were calculated when the activation function chosen was the sigmoid. Now that we are using the tanh, or later we might use the relu. How can we get the correct equations? I hope I’m making my self clear.
Thanks in advance
You would have to understand enough calculus to compute the partial derivative of the cost equation using the tanh() activation.
The derivative of the sigmoid is not the same as the derivative of the tanh so the ecuation should at least be different
Yes, you need to calculate the derivative of each of the activation functions. That is covered in the various assignments where we use something other than sigmoid. You’ll see tanh
in the Planar Data assignment in Week 3 and then you’ll see ReLU
in the “Step by Step” assignment in week 4. The relevant general equation that involves the activation function at each layer is:
dZ^{[l]} = dA^{[l]} * g^{[l]'}(Z^{[l]})
Of course the derivative g'() depends on what the function g() is.
But notice that in all cases here in DLS Course 1, we are doing binary classifications. That means that the activation at the output layer is always sigmoid
and that’s the derivative that interacts with the loss function. Here’s a thread showing how all the derivatives play out at the output layer in a binary classification.
Thanks for the answer, but backpropagation derivatives go till the beginning of the NN so when we use tanh as an activation function the formula should change right? Also I saw in the optional reading the formula of the derivative when using softmax and that one is different also, how do go generate a NN that takes into account those changes?
Have you gotten to week 4 of DLS Course 1 yet? There Prof Ng shows us the back propagation formulas at each layer and how the activation functions affect both forward and backward propagation. I gave the key formula earlier in this thread which shows the point at which the derivative of the activation function affects the results.
The overall point is that in back propagation, it starts from the output layer and we process each layer one at a time stepping backwards through the layers. For every layer other than the output (last) layer, the input to the computation for the current layer is the output of the back prop calculation for the next later layer. So we apply the formula I showed above at each layer and then it literally “propagates” backwards to the previous layers. So if we use tanh
at layer 3 of a 4 layer network, then the derivative of tanh
affects the results at layer 3, which then affects the results at layer 2 and layer 1. That’s what they mean by “backward propagation”.
This is all covered in the lectures and in the assignments. If you have not yet gotten through DLS Course 1, I suggest you “hold that thought” and proceed with the course and listen to what Prof Ng explains and then you’ll get to implement it in the assignments. It should all be clear after that. If not, we can discuss more.