Calculation of partial derivative of the cost function for logistic regression

Has anyone actually done the maths to calculate the partial derivative of the cost function J(\vec w, b) with respect to w_j?.

I have gone through my calculations twice and I get the same result as Andrew in his Video “Gradient Descent Implementation” but with a minus sign in front of the result.

I taken screenshots of my calculations and I have pasted them here if you need to check them.

Its been many years now since I calculated partial derivatives so I may have made an error somewhere.

Hopefully someone with more recent experience can spot the error.





I see my mistake now. I had written down f(\vec w, b) incorrectly before differentiating it. I had…

f(\vec w, b) = \frac{1}{1 - e^{-z}}

instead of…

f(\vec w, b) =\frac{1}{1 + e^{-z}}

I’ll keep the post here for anyone who is interested in how Andrew arrives at the expression for the partial derivative.

1 Like

That’s nice homework to do to understand algorithm properly. Thank you for sharing it.

2 Likes

Isn’t it amazing how the gradient descent algorithm is identical to that for linear regression except for f(\vec w, b)?

1 Like

Yes, it is a remarkable coincidence.

2 Likes

Seems like too much of a coincidence given that f(\vec w, b) are so different - one is linear and the other non-linear.

1 Like

Apparently the non-linear log function in the cost equation is counteracted by the nonlinear exponential in the sigmoid function that is part of f(w, b).

2 Likes

and it is also reasonable for loss function’s first derivative to be proportional to error such that it tends to zero when the error tends to zero.

Can you present this result mathematically, say using…

L(f(\vec w, b), y^{(i)}) = -log(f(\vec w, b))

for y^{(i)} = 1

I have also noticed that Andrew is missing a constant factor of…

\frac{1}{ln(10)}

in his final result for gradient descent of logistic regression.

That was just an intuition, and not a result of a mathematical deviation.

If the first derivative of the loss was not tend to zero when error tends to 0, then gradient descent would push it away which is not a favourable consequence.

By error, I meant this part of your work → image. It was called error because it’s the difference between the truth and the prediction.

Doesn’t this argument sound reasonable to you? I am a Physics graduate, and we always like to discuss intuitive understanding of maths formula, though it’s not always easy to.

Cheers,
Raymond

P.S. you have presented mathematically the result of the loss being proportional to the error :wink:

The slides might not have (I havn’t checked all of them) specified the base for the log, but if we think backward from the final result of gradient descent of logistic regression, the base was e.

The first derivative of…

L(f(\vec w, b), y^{(i)}) = -log(\frac{1}{1 + e^{-z}})

is not zero when the error…

f(\vec w, b)

equals zero if you think about how…

-log(x)

Approaches \infty as x \to 0, passes through x = 1 and continues to take negative values as x \to \infty.

The first derivative never becomes zero.

I was talking about the first derivative of the loss with respect to weight.

Mathematics nomenclature uses log(x) to mean base-10 which Andrew uses and ln(x) to mean base-e.

But this is a Machine Learning class, though I don’t know who can tell what Machine Learning nomenclature is regarding the use of log, but this Machine Learning Specialization uses base e.

The first derivative never reaches zero with respect to the error or w_j if you think about the “shape” of -log(x) against x.

Agreed! I should change it from “equal zero” to “tends to zero”.

1 Like

I have changed it to “tends to zero” in my previous post.

Not “Machine Learning nomenclature”, mathematical nomenclature.

Shouldn’t Andrew be using ln(x) instead of log(x) if base-e logarithms are being used?

He could, but I am not sure that he should.

Btw, what about the intuition? Does it make sense to you, that as the error tends to zero, the derivative should too?