Logistic Regression using the sigmoid function

Weights are updated by using the gradients of the logistic cost function.

It’s the same sort of process used for linear regression, except the logistic cost function is different.

What are the mathematical expressions for updating the w and b values in logistic regression?

What is the mathematical expression for the logistic regression cost function?

See the Week 3 lectures.

Logistic cost function:

Gradient descent for logistic regression:

And also

There are also optional labs to let you see exactly how the cost function and gradient descent are implemented.

Here’s one screenshot from the Implementing Gradient Descent video that shows the cost function and how it is used in gradient descent. (You can see how that ties into the screenshot I shared earlier). Definitely worth watching the videos and looking the labs to get more background.

1 Like

Hi Wendy,

Thanks for that.

I didn’t realise there was a different cost function in logistic regression .

I can see I will see that as I progress through that course’s week 3.

But I am still a little confused over the equations you you included in your last message.

It looks like the partial derivatives for linear regression are still used to determine w and b while there is no expression for the partial derivative of the cost function for logistic regression which includes logarithms.

Stephen.

Hello Wendy,

I have just looked at the Optional Lab again for the introduction to logistic regression and in the first graphic with the title “Sigmoid or Logistic Function” I noticed that it says;

“In the case of logistic regression, z (the input to the sigmoid function), is the output of a linear regression model.”

and

“in the case of multiple examples, 𝑧 may be a vector consisting of 𝑚 values, one for each example.”

But clearly for mutliple rows of input feature values, the value of z will still be a scalar since we are computing a vector dot product between the vectors w and x.

So I am confused now.

Stephen.

this hold true as before applying the sigmoid function to get a probability value (for logistic regression z), the model first calculates a linear combination of the features, which is represented by “z” and then feeds that value into the sigmoid function to produce the final probability prediction.

in logistics regression if you notice the equation

Let say we have two given vector

A = a1 * i + a2 * j + a3 * k
B = b1 * i + b2 * j + b3 * k, where i, j and k are the unit vector along the x, y and z directions.

Then dot product is calculated as
dot product = a1 * b1 + a2 * b2 + a3 * b3

for example
A = 3 * i + 5 * j + 4 * k
B = 2 * i + 7 * j + 5 * k
dot product = 3 * 2 + 5 * 7 + 4 * 5
= 6 + 35 + 20
= 61

Here A and B are input features and a1, a2, a3 are multiple rows of A input features and same way for the B input features b1, b2, b3 are multiple rows of B input features.

If we have A, B, C examples,then z gives a vector output (A, its value), (B, its value) and (C, its value)

This would calculate the multiple features with number of observation using linear regression which would give a vector product of the feature and it’s value, to this the sigmoid function is applied to applied to get dot product.

I am sharing a pdf of a machine learning model i.r.t. classification of tumor which explains this part.

Kalaiyarasi_2020_IOP_Conf._Ser.__Mater._Sci._Eng._995_012028.pdf (1.2 MB)

let me know if you are still confused.

Hello Deepti,

I don’t quite understand your explanation.

You say “…a1, a2, a3 are mutliple rows of A input features…” so are you assuming that there is only weight parameter w here?

No, weight as well as bias is also added, but it depends on the threshold of tumor size to be determined either benign or malignant. say the threshold is give 0.5 and anything beyond 0.5 is considered malignant.

so your z would be the output of this threshold parameters with comparison of input feature to the multiple values giving you another z value for each feature.

then to this z value the sigmoid activation function is applied, giving you the output of linear regression model.

notice the picture where @Wendy shared of linear and logistic regression?

Did you go through the pdf I shared?

I’m still confused over your first reply.

can you first respond if you went through the pdf

where it explains when there are multiple row in relation to feature, the matrix output for each observation (xi) to the jth feature given you a vector value.

Did you check that?

I don’t think it is necessary to consult documentation outside Andrew’s course.

I just need an explanation using the content of the course as that is what I am not understanding.

if you cannot look beyond your objectives of understanding, then it is difficult for anyone to explain you, because even professor Ng provides reference to many articles in his courses.

Hope you find your explanation.

Good luck!

It’s not that “…I cannot look beyond my objectives of understanding…”, I am only interested in looking within my objectives of understanding.

Stephen, getting back to your questions from the lab:

This is just talking about when you run the model to calculate a result given a set of input values. The way the model works, it will first calculate wx + b for the current w & b, and then pass that result to the sigmoid function to get a final result.
It’s still the case that when you are training the model using gradient descent as we discussed, that you will want to take the sigmoid function into account when recursively solving for w & b, using the functions we discussed earlier.

It will all start making sense as you continue with the videos.

Remember that when you are evaluating w . x + b for one example, you are using the dot product for w . x because there could be multiple features in x. As you say, this will result in a scalar. The lab is just reminding you that if you have multiple examples, you will get a scalar for each - resulting in a vector of scalars for z: one for each example.
You saw a similar situation in the previous week’s course about linear regression.

If this part still doesn’t make sense, I think it will click when you actually try it in future labs. It’s always helpful to look at real results to see what’s happening - maybe add a few print statements to look at results or try a simple case to help you see what’s going on.

1 Like

Hi Wendy,

It is still not clear to me what is meant by “…multiple examples…”.

Is an “example” a value for x that hasn’t been used in training the linear regression model and for which we want to know what the model predicts for this value of x that the model hasn’t “seen before”?

Stephen.

An “example” is a term that is commonly used to refer to one x (set of features) and its corresponding y (target value). When you’ve seen a plot of the data points in the videos, each of those points corresponds to an example.

It’s common (esp. in this course) to use the variable m to refer to the total number of training examples.

1 Like

Ok, that makes sense.

But what is the mathematical expression for the evaluation of g(z) when z is a vector of scalars?

I don’t recall Andrew talking about the “vector product” in logistic regression being performed between two vectors.

In the MLS lectures and exercises, he doesn’t very often discuss training sets that have more than one feature.

It appears in the linear regression examples for the house price prediction, although there he may be treating them as two scalars in each example rather than as a two-element vector. Functionally they are equivalent.

In logistic regression I believe it’s in the materials for creating additional polynomial features.