How to get the derivatives of the logistic cost/loss function [TEACHING STAFF]

Hi! If you’re wondering how to get the derivatives for the logistic cost / loss function shown in course 1 week 3 “Gradient descent implementation”:

I made a Google Colab (includes videos and code) that explains how to get these equations.
(hold Ctrl + click for windows, Command + Click for Mac to open in a new tab)
ML Specialization Course 1 - Derivatives of the Logistic Loss Function


You can also just watch the lecture videos on YouTube:
Derivatives of the Logistic Cost Function


Here is a pdf file with the slides:
Derivative of logistic loss slides.pdf (1.8 MB)


derivative logistic loss colab gif


Here are some of the slides from the videos (you’ll hear more commentary in the videos).
You can get the derivative of the loss with respect to parameter “w” by calculating three separate derivatives and multiplying them together. This is the “chain rule” in calculus and it’s a useful concept that shows up elsewhere (like in neural networks, which you’ll learn about in course 2).





Here’s how to get the first derivative:



Here’s how to get the second derivative:



Here’s how to get the third derivative:



And if you multiply the three derivatives together, you’ll end up with the expression that you saw in the lectures:




To get the derivative of the cost with respect to parameter “b”, you can actually just calculate the first derivative, and reuse the second and third derivatives from before.







One thing worth noting is that being able to re-use some of the derivative calculations elsewhere is very helpful. You’ll see more of this when you learn about neural networks in the second course.

Let me know what you think!
-Eddy

18 Likes

Great explanation for anyone trying to understand these derivatives. Thanks!

1 Like

Thank you for this! Was exactly what I was looking for :slight_smile:

1 Like

Dear Eddy, well done and very neat explanation!

I could also run your code of “ML_Specialization_Course_1_Derivatives_of_the_Logistic_Loss_Function.ipynb” on Jupyter Notebook 6.5.2, Python 3.7.7 after downloading it from my google drive’s copy.

The comparison of Andrew’s equation (e.g. dL_dw=(f-y)*xj) and your chain rule equation (e.g. dL_dw=dL_df * df_dz * dz_dw) is very good too!

Thanks again!

2 Likes

Thank you! This was really good explanation and straightforward. I feel at peace now that I know how these equations are arrived at.

3 Likes

This was super helpful, so thanks for putting the video and slides together.

I presume it is no accident that the derivatives for both logistic cost and squared error cost used in linear regression end up being the exact same? Is that just really good luck (for implementations), or were the cost functions chosen so that their derivatives end up being the same?

One way to think about this is to keep clear in your mind that there are two functions involved here: the activation function at the output layer that actually generates the prediction and then the loss function that is the metric for how good that prediction is.

In the logistic regression case, the pairing is sigmoid (which is based on the exponential function) and cross entry loss (which is based on the natural logarithm).

In the case of Linear Regression, the activation is typically the identity function or ReLU and the loss function is MSE (mean squared error).

So what you’re doing is computing the derivative of the composite function. The fact that you end up with a very nice well behaved gradient means that you have a good pairing between your activation function and your loss or “distance” function. In other words that you made a good choice of the loss function based on the nature and meaning of your predictions values.

This is how I derived it. I don’t know where I went wrong (I get f and (1-f) in the denominator). Could you please help?

Thanks!


In the first term of the first line in your first photo, you have missed \frac{\partial{f}}{\partial{b}}. Note that it is not equal to 1 so you can’t miss it out.

Check for similar miss-out in other terms.

Thank you @eddy It was the last missing piece after I finalised the course. I didn’t expect to find so well lined out explanation on this forum. Highly appreciated!

Here, I’ve just used another way to calculate \frac{\partial{L}}{\partial{w}} and acquire the same result by some transformations and direct derivation:

I know this is not practical way but I just want to strengthen my understanding by looking for a pure substitution method. Please correct me if I was doing something wrong :sweat_smile:

\eqalign{ L &= -[y * ln(\frac{1}{1 + e^{-(wx+b)}}) + (1 - y) * ln(1 - \frac{1}{1 + e^{-(wx+b)}})] \\ &= -[y * ln(\frac{1}{1 + e^{-(wx+b)}}) + (1 - y) * ln(\frac{e^{-(wx+b)}}{1 + e^{-(wx+b)}})] \\ }

Now, I used the fact that ln(\frac{1}{1 + e^{-z}}) = -ln(1 + e^{-z}) and ln(\frac{e^{-z}}{1 + e^{-z}}) = -z - ln(1 + e^{-z}), where z = wx+b:

\eqalign{ L &= -[-y * ln(1 + e^{-(wx+b)}) + (1 - y) * (-(wx+b) - ln(1 + e^{-(wx+b)}))] \\ &= -[-y * ln(1 + e^{-(wx+b)}) - (1 - y) * ln(1 + e^{-(wx+b)}) - (1 - y) * (wx+b)] \\ &= ln(1 + e^{-(wx+b)}) + (1 - y) * wx + (1 - y) * b] \\ }

Now, find \frac{\partial{L}}{\partial{w}}:

\eqalign{ \frac{\partial{L}}{\partial{w}} &= \frac{-x * e^{-(wx+b)}}{1 + e^{-(wx+b)}} + (1 - y) * x] \\ &= \frac{-x * e^{-(wx+b)} + (1 - y) * x * (1 + e^{-(wx+b)})}{1 + e^{-(wx+b)}} \\ &= \frac{-x * e^{-(wx+b)} + (1 - y) * x + (1 - y) * x * e^{-(wx+b)}}{1 + e^{-(wx+b)}} \\ &= \frac{-xy * e^{-(wx+b)} + x - xy}{1 + e^{-(wx+b)}} \\ &= \frac{-xy * (1 + e^{-(wx+b)}) + x}{1 + e^{-(wx+b)}} \\ &= -xy + \frac{x}{1 + e^{-(wx+b)}} \\ &= (\frac{1}{1 + e^{-(wx+b)}} - y) * x \\ &= (sigmoid(z) - y) * x = (f - y) * x }

My confusion was that how come the derivative of cost function for linear and logistic regression are same when the loss function is different for them? The derivative of the of cost function for linear regression was derived from MSE. I stumbled upon this article while I was searching for any comments and I found the video and this note was really helpful. This set of videos on youtube are truly amazing!!

Does it mean that the derivative of cost function can be generalised to 1/m(sumof ( (f(x) - y) * x)) – Something depected in attached image?
Screenshot from 2023-12-08 14-59-39

Dear Eddy,

thanks for all the material. I just noticed, when rehearsing my coursenotes that I had no idea why the derivative of the logistic cost function is the same as that of the linear cost function, although both cost functions look so very different.
I do understand now and I was able to follow the explanation except for one step:
Why is d/dz (1+e^-z) = -1e^-z ?
I couldnt figure out that one.

Best regards

Fabian

Well, the first step is simple, because it’s clear that:

\displaystyle \frac {d}{dz} (1 + e^{-z}) =\frac {d}{dz} e^{-z}

From there it is a straightforward application of the Chain Rule. We are composing two functions:

g(z) = -z
f(z) = e^z

So our function is:

h(z) = f(g(z))

Then the Chain Rule says that:

h'(z) = f'(g(z)) * g'(z)

If you work that out, you’ll see it gives exactly what Eddy shows there.

The Chain Rule is pretty fundamental and highly relevant to Neural Networks, since they are one huge function composition: the input of each function is the output of the previous function all the way from the input data to the output cost value. The Chain Rule is covered in the first semester of any calculus course. Right after the Sum Rule, the Product Rule and the Exponent Rule.

Thank you Paul,

that was the missing piece for me.
I am familiar with the chain rule in principle, but I didn’t know how to deal with the z in the exponent.

Best regards

Fabian

1 Like

Hello,

I am wondering how the second summation disappears in the d/dw term of regularized logistic regression. Any help is greatly appreciated!

Have you taken multivariate calculus? We are taking the partial derivative there w.r.t. w_j, right? So that means that all the other w_i for i \neq j are constant w.r.t. w_j, so they disappear when you take the partial derivative. The derivative of a constant is 0.

1 Like

Got it! Thank you!

Another approach to calculate the backpropagation is to use the circuit diagram. Read the note in the Sandford course CS231n at CS231n Convolutional Neural Networks for Visual Recognition

1 Like