Confused in the gradient descent of the logistic log loss function

Lets keep the derivation part a side, it is too complicated for now.

Why y is subtracted, in the previous lecture (simplified form), no matter what class we use, thr y term supposed to be multiplied to the ln part.

The cost function should be J = \frac{-1}{m} × \sum_{i=1}^m \begin{pmatrix} y_i \cdot log(z_i) + (1 - y_i) \cdot log(1 - z_i) \end{pmatrix} where z_i = \frac{1}{1 + e^{-\vec W_i * \vec X_i - b_i}} .

I’m not certain what you’re asking. At first I thought you were asking why we can combine the two loss functions into one line. Then I saw your sigmoid question. So I’m going to answer both as I understand your question.

Normally we’d use two different functions depending on if y is 0 or if y is 1. By combining both functions into one line we’re eliminating the need for a if check every time we calculate it.

If y = 0 then -y^i * log(f_{w,b}x^i) would evaluate to 0 because 0 * any value = 0.
if y = 1 then (1 - y^i) * log (1 - f_{w,b}x^i would evaluate to 0 because 1 - 1 = 0.

This allows us to use which loss is appropriate by optimizing it in a way that computers can perform faster but slightly harder for humans to read.

For the sigmoid portion you’re correct. J represents the cost function as a whole. f(w,b) represents the loss function. We run the loss function through the sigmoid function to limit the values so in the end we get predictions between 0 and 1. It’s written the way Andrew has presented it because we’re taking different modular pieces and putting them together in a specific way. We could in theory swap out the loss function for a different loss function and get different results but the rest of the function remains the same.

Andrew will quite often start with a subject using material we should already understand and modify it in small parts. Adding layers of complexity until at the end of the week we have a complete picture of how we can optimize and perfect the functions.

1 Like

HI @tbhaxor

In the linear regression we use the cost function is J(w,b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2 .
we didn’t use log because the cost function is converge after each iteration without fall in local minimum cost and also if it falled in it the local minimum in this case would be very close to the global minimum cost like this photo
the difference between Global minimum and local minimum values is like this photo

  • But

In the logistic regression the cost function isn’t linear convex which means that a local minimum could be easily found before reaching the global minimum(The global minimum cost value is away(so small) from the local minimum cost value )like this photo
.In order to ensure the cost function is convex (and therefore ensure convergence to the global minimum), the cost function is transformed using the logarithm of the sigmoid function to remove ripples from the cost function and make it smoother and converge like this photo
. we use these function loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)}) = (-y^{(i)} \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) \tag{2}


Yes I know this, what I am saying the differentiation or gradient of logistic log function should not be the one used in the second function. This is because,

f(w,b,x) = \frac{1}{1+e^{-(w*x+b)}}

The differentiation of the log loss (lets take the first term, positive true label) wrt w will be

\frac{\partial ln(f(w,b,x))}{\partial b} = \frac{1}{e^{wx+b} + 1} which is incorrect in the second screenshot I shared

@tbhaxor It is partial derivative according to what weight do you make derivative to so the partial derivative of log using chain rule isn’t \frac{1}{xln(a)} so that you didn’t want to use ln

Combining both the equation we get a convex log loss function as shown below-

Combined Cost Function

In order to optimize this convex function, we can either go with gradient-descent or newtons method. For both cases, we need to derive the gradient of this complex loss function. The mathematics for deriving gradient is shown in the steps given below

The Derivative of Cost Function:

Since the hypothesis function for logistic regression is sigmoid in nature hence, The First important step is finding the gradient of the sigmoid function. We can see from the derivation below that gradient of the sigmoid function follows a certain pattern.

Hypothesis Function

Derivative of Sigmoid Function

Step 1:

Applying Chain rule and writing in terms of partial derivatives.

Step 2:

Evaluating the partial derivative using the pattern of the derivative of the sigmoid function.

Step 3:

Simplifying the terms by multiplication

Step 4:

Removing the summation term by converting it into a matrix form for the gradient with respect to all the weights including the bias term.

I think you are missing the point here, the partial derivative is done on the cost function not the sigmoid function so why you omited ln here.

The partial derivative on

means you are not taking the logs as you can shown in the first screenshot

HI @tbhaxor

I Know that the partial derivative for the cost function but I make here the derivative of the sigmoid function to substitute in( h(x) ) the partial derivative of the cost function

We can differentiate the following 3 equations separatelyimage

This called Chain rule
so when you simplify these equations, this results in these equations

Please explain what you said and some of the doubt will be cleared. Yes i have found it can be solved by the chain rule

Without using chain rule :
This is the original equation of the cost function

we can simplify the log(h(X)) to be

after that also we simplify the log(1-h(X)) to be

Plugging in the two simplified expressions above, we obtain

, which can be simplified to: according to log(h(X)) - log(h(y)) = log(\frac{h(X)}{h(y)}) to be

where the second equality follows from

according to log(h(X)) + log(h(y)) = log({h(X)}*{h(y)})
All you need now is to compute the partial derivatives of (∗) θjθj. As


Also if you still have some confused about some thing feel free to ask