Confused in the gradient descent of the logistic log loss function

tbhaxor · January 11, 2023, 2:23pm

Lets keep the derivation part a side, it is too complicated for now.

Why y is subtracted, in the previous lecture (simplified form), no matter what class we use, thr y term supposed to be multiplied to the ln part.

The cost function should be J = \frac{-1}{m} × \sum_{i=1}^m \begin{pmatrix} y_i \cdot log(z_i) + (1 - y_i) \cdot log(1 - z_i) \end{pmatrix} where z_i = \frac{1}{1 + e^{-\vec W_i * \vec X_i - b_i}} .

RyanCarr · January 11, 2023, 3:41pm

I’m not certain what you’re asking. At first I thought you were asking why we can combine the two loss functions into one line. Then I saw your sigmoid question. So I’m going to answer both as I understand your question.

Normally we’d use two different functions depending on if y is 0 or if y is 1. By combining both functions into one line we’re eliminating the need for a if check every time we calculate it.

If y = 0 then -y^i * log(f_{w,b}x^i) would evaluate to 0 because 0 * any value = 0.
if y = 1 then (1 - y^i) * log (1 - f_{w,b}x^i would evaluate to 0 because 1 - 1 = 0.

This allows us to use which loss is appropriate by optimizing it in a way that computers can perform faster but slightly harder for humans to read.

For the sigmoid portion you’re correct. J represents the cost function as a whole. f(w,b) represents the loss function. We run the loss function through the sigmoid function to limit the values so in the end we get predictions between 0 and 1. It’s written the way Andrew has presented it because we’re taking different modular pieces and putting them together in a specific way. We could in theory swap out the loss function for a different loss function and get different results but the rest of the function remains the same.

Andrew will quite often start with a subject using material we should already understand and modify it in small parts. Adding layers of complexity until at the end of the week we have a complete picture of how we can optimize and perfect the functions.

AbdElRhaman_Fakhry · January 11, 2023, 5:42pm

HI @tbhaxor

In the linear regression we use the cost function is J(w,b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2 .
we didn’t use log because the cost function is converge after each iteration without fall in local minimum cost and also if it falled in it the local minimum in this case would be very close to the global minimum cost like this photo

the difference between Global minimum and local minimum values is like this photo

But

In the logistic regression the cost function isn’t linear convex which means that a local minimum could be easily found before reaching the global minimum(The global minimum cost value is away(so small) from the local minimum cost value )like this photo

.In order to ensure the cost function is convex (and therefore ensure convergence to the global minimum), the cost function is transformed using the logarithm of the sigmoid function to remove ripples from the cost function and make it smoother and converge like this photo

. we use these function loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)}) = (-y^{(i)} \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) \tag{2}

Cheers!
Abdelrahman

tbhaxor · January 11, 2023, 5:49pm

Yes I know this, what I am saying the differentiation or gradient of logistic log function should not be the one used in the second function. This is because,

f(w,b,x) = \frac{1}{1+e^{-(w*x+b)}}

The differentiation of the log loss (lets take the first term, positive true label) wrt w will be

\frac{\partial ln(f(w,b,x))}{\partial b} = \frac{1}{e^{wx+b} + 1} which is incorrect in the second screenshot I shared

AbdElRhaman_Fakhry · January 11, 2023, 6:03pm

@tbhaxor It is partial derivative according to what weight do you make derivative to so the partial derivative of log using chain rule isn’t \frac{1}{xln(a)} so that you didn’t want to use ln

Combining both the equation we get a convex log loss function as shown below-

Combined Cost Function

In order to optimize this convex function, we can either go with gradient-descent or newtons method. For both cases, we need to derive the gradient of this complex loss function. The mathematics for deriving gradient is shown in the steps given below

The Derivative of Cost Function:

Since the hypothesis function for logistic regression is sigmoid in nature hence, The First important step is finding the gradient of the sigmoid function. We can see from the derivation below that gradient of the sigmoid function follows a certain pattern.

Hypothesis Function

Derivative of Sigmoid Function

Step 1:

Applying Chain rule and writing in terms of partial derivatives.

Step 2:

Evaluating the partial derivative using the pattern of the derivative of the sigmoid function.

Step 3:

Simplifying the terms by multiplication

Step 4:

Removing the summation term by converting it into a matrix form for the gradient with respect to all the weights including the bias term.

tbhaxor · January 12, 2023, 5:20pm

I think you are missing the point here, the partial derivative is done on the cost function not the sigmoid function so why you omited ln here.

The partial derivative on
1*JH0SOoHwq8jM32ga1-eW6Q

means you are not taking the logs as you can shown in the first screenshot

AbdElRhaman_Fakhry · January 12, 2023, 5:36pm

HI @tbhaxor

I Know that the partial derivative for the cost function but I make here the derivative of the sigmoid function to substitute in( h(x) ) the partial derivative of the cost function

We can differentiate the following 3 equations separately

This called Chain rule
so when you simplify these equations, this results in these equations

tbhaxor · January 12, 2023, 5:40pm

Please explain what you said and some of the doubt will be cleared. Yes i have found it can be solved by the chain rule

AbdElRhaman_Fakhry · January 12, 2023, 6:01pm

Without using chain rule :
This is the original equation of the cost function

we can simplify the log(h(X)) to be

after that also we simplify the log(1-h(X)) to be

Plugging in the two simplified expressions above, we obtain

, which can be simplified to: according to log(h(X)) - log(h(y)) = log(\frac{h(X)}{h(y)}) to be

where the second equality follows from

according to log(h(X)) + log(h(y)) = log({h(X)}*{h(y)})
All you need now is to compute the partial derivatives of (∗) θjθj. As

Cheers,
Abdelrahman

AbdElRhaman_Fakhry · January 12, 2023, 6:22pm

Also if you still have some confused about some thing feel free to ask

Topic		Replies	Views
Logistic Regression: Difference between cost function & gradient descent Supervised ML: Regression and Classification week-3	5	576	August 8, 2022
Why isn't the actual loss function for logistic regression not put in place of cost function while implementing gradient descent? Shouldn't the cost function containing the log function be partially differentiated? Supervised ML: Regression and Classification week-3	9	871	October 10, 2022
Clarification of the Derivative of the Log Loss Function Neural Networks and Deep Learning coursera-platform	2	963	April 17, 2022
Why logistic regression is not used to calculate gradient descent Supervised ML: Regression and Classification week-3	8	293	May 8, 2024
Minor error in video - Course 1, Week 2 Neural Networks and Deep Learning coursera-platform	3	537	March 15, 2022