In the C1W3 we learn that the sum of squared root errors is not a suitable cost function for logistic regression as it can get minimized to a local minimum instead of the global one. The instructor presents the odd squiggly line produced by the sigmoid function.
For the cost function, we are presented with the loss function of:
(-y * np.log(sigmoid(predictions))) - ((1 - y) * np.log(1 - sigmoid(predictions)))
This should make the cost function convex and allow the gradient descent to find the global minimum.
However later when describing the method to compute the gradient descent we do not use this cost function but basically just the: sum of the residuals of sigmoid - actual
What is the point of the loss function then ?
Take the example provided in the optional lab and the quiz. We have calculated gradient descent without using this loss function:
iterate over rows:
iterate over column:
z = sum slopes and features
add y_intercept (b)
f(x) = sigmoid(z)
It just happens that the gradient of logistic regression model with log loss (that you wrote in your post), and the gradient of linear regression with squared loss, the two gradients look exactly the same. If we differentiate the cost functions for linear regression and logistic regression, we will get to this conclusion that the gradients look exactly the same.
Hm. I’m just trying to understand when to use the presented loss function then.
Here we implement the loss function, we are told this will make the function convex and allow us to find the global minimum.
However in Exercise 3:
We don’t use this loss function, we sum up the result of the sigmoid function. I assume there is some mathematical equivalence between the two, but I’m trying to understand why we don’t use the loss function when calculating the gradient descent.
Since the presenter stated that the definition of the cost function as presented in Exercise 3 will result in a non-convex function.
From the theoretical point of view, the gradient formulae (in Ex 3) is derived from the cost function (in Ex 2) so the former depends on the latter. We do differentiation on the cost function to get the gradient formulae. The differentation steps are not shown in the assignment.
The differentiation steps need only to be done with paper and pencil to go from the cost function to the gradients formulae. Once we get the gradient formulae, we just implement its final form and therefore you will not see their relations in Python code. The differentiation steps were done on paper.
From this assignment point of view, you only need
compute_gradient to do gradient descent. The purpose for
compute_cost is to record the costs at each step of the gradient descent, so that at the end, we can (although we didn’t) plot how the cost drops over the gradient descent steps.
The costs plot is useful for visual inspection of whether the cost has converged (a sign of training complete) or not, and it is indeed also useful for detemining overfitting if we plot the cost curves of the training set and the cv set. The use of such plot for identifying overfitting will be covered in course 2 week 3.
@rmwkwok A ok understood, that makes more sense now. Thank you for the explanation.