How to get the derivatives of the logistic cost/loss function [TEACHING STAFF]

eddy · June 24, 2022, 5:51am

Hi! If you’re wondering how to get the derivatives for the logistic cost / loss function shown in course 1 week 3 “Gradient descent implementation”:

I made a Google Colab (includes videos and code) that explains how to get these equations.
(hold Ctrl + click for windows, Command + Click for Mac to open in a new tab)
ML Specialization Course 1 - Derivatives of the Logistic Loss Function

You can also just watch the lecture videos on YouTube:
Derivatives of the Logistic Cost Function

Here is a pdf file with the slides:
Derivative of logistic loss slides.pdf (1.8 MB)

derivative logistic loss colab gif

Here are some of the slides from the videos (you’ll hear more commentary in the videos).
You can get the derivative of the loss with respect to parameter “w” by calculating three separate derivatives and multiplying them together. This is the “chain rule” in calculus and it’s a useful concept that shows up elsewhere (like in neural networks, which you’ll learn about in course 2).

Here’s how to get the first derivative:

Here’s how to get the second derivative:

Here’s how to get the third derivative:

And if you multiply the three derivatives together, you’ll end up with the expression that you saw in the lectures:

To get the derivative of the cost with respect to parameter “b”, you can actually just calculate the first derivative, and reuse the second and third derivatives from before.

One thing worth noting is that being able to re-use some of the derivative calculations elsewhere is very helpful. You’ll see more of this when you learn about neural networks in the second course.

Let me know what you think!
-Eddy

VladimirFokow · June 24, 2022, 9:21am

Great explanation for anyone trying to understand these derivatives. Thanks!

lewisjcummins · June 24, 2022, 1:44pm

Thank you for this! Was exactly what I was looking for

Daniel_Blossey · November 29, 2022, 10:05am

Dear Eddy, well done and very neat explanation!

I could also run your code of “ML_Specialization_Course_1_Derivatives_of_the_Logistic_Loss_Function.ipynb” on Jupyter Notebook 6.5.2, Python 3.7.7 after downloading it from my google drive’s copy.

The comparison of Andrew’s equation (e.g. dL_dw=(f-y)*xj) and your chain rule equation (e.g. dL_dw=dL_df * df_dz * dz_dw) is very good too!

Thanks again!

ahmedjolani · December 5, 2022, 12:51am

Thank you! This was really good explanation and straightforward. I feel at peace now that I know how these equations are arrived at.

NotoriousDAD · May 25, 2023, 6:22pm

This was super helpful, so thanks for putting the video and slides together.

I presume it is no accident that the derivatives for both logistic cost and squared error cost used in linear regression end up being the exact same? Is that just really good luck (for implementations), or were the cost functions chosen so that their derivatives end up being the same?

paulinpaloalto · May 25, 2023, 7:34pm

One way to think about this is to keep clear in your mind that there are two functions involved here: the activation function at the output layer that actually generates the prediction and then the loss function that is the metric for how good that prediction is.

In the logistic regression case, the pairing is sigmoid (which is based on the exponential function) and cross entry loss (which is based on the natural logarithm).

In the case of Linear Regression, the activation is typically the identity function or ReLU and the loss function is MSE (mean squared error).

So what you’re doing is computing the derivative of the composite function. The fact that you end up with a very nice well behaved gradient means that you have a good pairing between your activation function and your loss or “distance” function. In other words that you made a good choice of the loss function based on the nature and meaning of your predictions values.

mvrbiguv · July 18, 2023, 3:45pm

This is how I derived it. I don’t know where I went wrong (I get f and (1-f) in the denominator). Could you please help?

Thanks!

rmwkwok · July 18, 2023, 4:12pm

In the first term of the first line in your first photo, you have missed \frac{\partial{f}}{\partial{b}}. Note that it is not equal to 1 so you can’t miss it out.

Check for similar miss-out in other terms.

medikoo · July 28, 2023, 3:58pm

Thank you @eddy It was the last missing piece after I finalised the course. I didn’t expect to find so well lined out explanation on this forum. Highly appreciated!

quandang · September 3, 2023, 10:07am

Here, I’ve just used another way to calculate \frac{\partial{L}}{\partial{w}} and acquire the same result by some transformations and direct derivation:

I know this is not practical way but I just want to strengthen my understanding by looking for a pure substitution method. Please correct me if I was doing something wrong

\eqalign{ L &= -[y * ln(\frac{1}{1 + e^{-(wx+b)}}) + (1 - y) * ln(1 - \frac{1}{1 + e^{-(wx+b)}})] \\ &= -[y * ln(\frac{1}{1 + e^{-(wx+b)}}) + (1 - y) * ln(\frac{e^{-(wx+b)}}{1 + e^{-(wx+b)}})] \\ }

Now, I used the fact that ln(\frac{1}{1 + e^{-z}}) = -ln(1 + e^{-z}) and ln(\frac{e^{-z}}{1 + e^{-z}}) = -z - ln(1 + e^{-z}), where z = wx+b:

\eqalign{ L &= -[-y * ln(1 + e^{-(wx+b)}) + (1 - y) * (-(wx+b) - ln(1 + e^{-(wx+b)}))] \\ &= -[-y * ln(1 + e^{-(wx+b)}) - (1 - y) * ln(1 + e^{-(wx+b)}) - (1 - y) * (wx+b)] \\ &= ln(1 + e^{-(wx+b)}) + (1 - y) * wx + (1 - y) * b] \\ }

Now, find \frac{\partial{L}}{\partial{w}}:

\eqalign{ \frac{\partial{L}}{\partial{w}} &= \frac{-x * e^{-(wx+b)}}{1 + e^{-(wx+b)}} + (1 - y) * x] \\ &= \frac{-x * e^{-(wx+b)} + (1 - y) * x * (1 + e^{-(wx+b)})}{1 + e^{-(wx+b)}} \\ &= \frac{-x * e^{-(wx+b)} + (1 - y) * x + (1 - y) * x * e^{-(wx+b)}}{1 + e^{-(wx+b)}} \\ &= \frac{-xy * e^{-(wx+b)} + x - xy}{1 + e^{-(wx+b)}} \\ &= \frac{-xy * (1 + e^{-(wx+b)}) + x}{1 + e^{-(wx+b)}} \\ &= -xy + \frac{x}{1 + e^{-(wx+b)}} \\ &= (\frac{1}{1 + e^{-(wx+b)}} - y) * x \\ &= (sigmoid(z) - y) * x = (f - y) * x }

Mohanish_Pachlore · December 8, 2023, 2:01pm

My confusion was that how come the derivative of cost function for linear and logistic regression are same when the loss function is different for them? The derivative of the of cost function for linear regression was derived from MSE. I stumbled upon this article while I was searching for any comments and I found the video and this note was really helpful. This set of videos on youtube are truly amazing!!

Does it mean that the derivative of cost function can be generalised to 1/m(sumof ( (f(x) - y) * x)) – Something depected in attached image?
Screenshot from 2023-12-08 14-59-39

Fabian_Harder · February 27, 2024, 12:55pm

Dear Eddy,

thanks for all the material. I just noticed, when rehearsing my coursenotes that I had no idea why the derivative of the logistic cost function is the same as that of the linear cost function, although both cost functions look so very different.
I do understand now and I was able to follow the explanation except for one step:
Why is d/dz (1+e^-z) = -1e^-z ?
I couldnt figure out that one.

Best regards

Fabian

paulinpaloalto · February 27, 2024, 3:14pm

Well, the first step is simple, because it’s clear that:

\displaystyle \frac {d}{dz} (1 + e^{-z}) =\frac {d}{dz} e^{-z}

From there it is a straightforward application of the Chain Rule. We are composing two functions:

g(z) = -z
f(z) = e^z

So our function is:

h(z) = f(g(z))

Then the Chain Rule says that:

h'(z) = f'(g(z)) * g'(z)

If you work that out, you’ll see it gives exactly what Eddy shows there.

The Chain Rule is pretty fundamental and highly relevant to Neural Networks, since they are one huge function composition: the input of each function is the output of the previous function all the way from the input data to the output cost value. The Chain Rule is covered in the first semester of any calculus course. Right after the Sum Rule, the Product Rule and the Exponent Rule.

Fabian_Harder · February 28, 2024, 10:27am

Thank you Paul,

that was the missing piece for me.
I am familiar with the chain rule in principle, but I didn’t know how to deal with the z in the exponent.

Best regards

Fabian

dennis.morgan310 · March 3, 2024, 3:42am

Hello,

I am wondering how the second summation disappears in the d/dw term of regularized logistic regression. Any help is greatly appreciated!

paulinpaloalto · March 3, 2024, 4:20am

Have you taken multivariate calculus? We are taking the partial derivative there w.r.t. w_j, right? So that means that all the other w_i for i \neq j are constant w.r.t. w_j, so they disappear when you take the partial derivative w.r.t. w_j. The derivative of a constant is 0.

dennis.morgan310 · March 3, 2024, 4:23am

Got it! Thank you!

krashilili · May 9, 2024, 10:24pm

Another approach to calculate the backpropagation is to use the circuit diagram. Read the note in the Sandford course CS231n at CS231n Convolutional Neural Networks for Visual Recognition

Topic		Replies	Views
Derivative of "Simplified Cost Function" Supervised ML: Regression and Classification week-module-3	1	586	March 19, 2023
Gradient descent for Logistic Regression : Same partial derivative. Why? Supervised ML: Regression and Classification week-module-3	13	1154	September 11, 2022
Clarification of the Derivative of the Log Loss Function Neural Networks and Deep Learning coursera-platform	2	974	April 17, 2022
Gradient Descent for logistic regression prrof Supervised ML: Regression and Classification week-module-3	4	25	June 27, 2025
Why isn't the actual loss function for logistic regression not put in place of cost function while implementing gradient descent? Shouldn't the cost function containing the log function be partially differentiated? Supervised ML: Regression and Classification week-module-3	9	875	October 10, 2022

How to get the derivatives of the logistic cost/loss function [TEACHING STAFF]

Related topics