Here are some of the slides from the videos (you’ll hear more commentary in the videos).
You can get the derivative of the loss with respect to parameter “w” by calculating three separate derivatives and multiplying them together. This is the “chain rule” in calculus and it’s a useful concept that shows up elsewhere (like in neural networks, which you’ll learn about in course 2).
To get the derivative of the cost with respect to parameter “b”, you can actually just calculate the first derivative, and reuse the second and third derivatives from before.
One thing worth noting is that being able to re-use some of the derivative calculations elsewhere is very helpful. You’ll see more of this when you learn about neural networks in the second course.
I could also run your code of “ML_Specialization_Course_1_Derivatives_of_the_Logistic_Loss_Function.ipynb” on Jupyter Notebook 6.5.2, Python 3.7.7 after downloading it from my google drive’s copy.
The comparison of Andrew’s equation (e.g. dL_dw=(f-y)*xj) and your chain rule equation (e.g. dL_dw=dL_df * df_dz * dz_dw) is very good too!
This was super helpful, so thanks for putting the video and slides together.
I presume it is no accident that the derivatives for both logistic cost and squared error cost used in linear regression end up being the exact same? Is that just really good luck (for implementations), or were the cost functions chosen so that their derivatives end up being the same?
One way to think about this is to keep clear in your mind that there are two functions involved here: the activation function at the output layer that actually generates the prediction and then the loss function that is the metric for how good that prediction is.
In the logistic regression case, the pairing is sigmoid (which is based on the exponential function) and cross entry loss (which is based on the natural logarithm).
In the case of Linear Regression, the activation is typically the identity function or ReLU and the loss function is MSE (mean squared error).
So what you’re doing is computing the derivative of the composite function. The fact that you end up with a very nice well behaved gradient means that you have a good pairing between your activation function and your loss or “distance” function. In other words that you made a good choice of the loss function based on the nature and meaning of your predictions values.
In the first term of the first line in your first photo, you have missed \frac{\partial{f}}{\partial{b}}. Note that it is not equal to 1 so you can’t miss it out.
Thank you @eddy It was the last missing piece after I finalised the course. I didn’t expect to find so well lined out explanation on this forum. Highly appreciated!
Here, I’ve just used another way to calculate \frac{\partial{L}}{\partial{w}} and acquire the same result by some transformations and direct derivation:
I know this is not practical way but I just want to strengthen my understanding by looking for a pure substitution method. Please correct me if I was doing something wrong
My confusion was that how come the derivative of cost function for linear and logistic regression are same when the loss function is different for them? The derivative of the of cost function for linear regression was derived from MSE. I stumbled upon this article while I was searching for any comments and I found the video and this note was really helpful. This set of videos on youtube are truly amazing!!
Does it mean that the derivative of cost function can be generalised to 1/m(sumof ( (f(x) - y) * x)) – Something depected in attached image?
thanks for all the material. I just noticed, when rehearsing my coursenotes that I had no idea why the derivative of the logistic cost function is the same as that of the linear cost function, although both cost functions look so very different.
I do understand now and I was able to follow the explanation except for one step:
Why is d/dz (1+e^-z) = -1e^-z ?
I couldnt figure out that one.
From there it is a straightforward application of the Chain Rule. We are composing two functions:
g(z) = -z f(z) = e^z
So our function is:
h(z) = f(g(z))
Then the Chain Rule says that:
h'(z) = f'(g(z)) * g'(z)
If you work that out, you’ll see it gives exactly what Eddy shows there.
The Chain Rule is pretty fundamental and highly relevant to Neural Networks, since they are one huge function composition: the input of each function is the output of the previous function all the way from the input data to the output cost value. The Chain Rule is covered in the first semester of any calculus course. Right after the Sum Rule, the Product Rule and the Exponent Rule.
Have you taken multivariate calculus? We are taking the partial derivative there w.r.t. w_j, right? So that means that all the other w_i for i \neq j are constant w.r.t. w_j, so they disappear when you take the partial derivative w.r.t. w_j. The derivative of a constant is 0.