Logistic Regression fundamental question

I am working my way through Course 1 Week of Regression and Classification as some of you may know but I have a fundamental question on logistic regression.

Why does Andrew use a value for z of w \cdot x + b in the first place when computing g(z)?

What is the motivation for using a linear regression prediction model as an argument to the function g(z)?

1 Like

Hi @ai_is_cool

The function g(z) (sigmoid) in logistic regression is applied to a linear combination of input features, z = w \cdot x + b , because this formulation provides a way to model probabilities and also keep interpretability.

The linear expression w \cdot x + b is a weighted sum of the input features (similar to linear regression) which captures the relationship between input variables and the predicted outcome. However, since linear regression outputs unbounded values, applying the sigmoid function helps us to keep the output constrained between 0 and 1 (suitable for binary classification).

Hope it helps! Feel free to ask if you need further assistance.

1 Like

Thanks Alireza.

Can you explain more mathematically what you mean by…

“ because this formulation provides a way to model probabilities and also keep interpretability.”

I can see that the expression

f_{w, b}(x^{(i)}) = wx^{(i)} + b

maps tumor size to any positive or negative number but is that number an estimate of the probability of the tumor being malignant using linear regression?

Thanks

The point is, as I see it, that the model is trained with target values 0’s and 1’s, 1 meaning a certain outcome (say “the tumor is malignant”) and 0 meaning its opposite. The logistic function maps the values of the linear combination of inputs into a number between 0 and 1, so that the most close to 1 that number is the most likely the “yes” outcome. Hope this helps.

Thanks Luis,

Do you know how initial values of w and b are chosen before the logistic regression model algorithm produces its best estimates of whether a tumor is malignant or benign?

1 Like

The values of w and b are set during training using gradient descent. The training process starts with some random, small values and gradient descent changes them step by step searching for the values that minimize the loss function for the training set. If you get a small loss for the training set, chances are that the final values for w and b will work fairly well when applying the model to new cases (provided the model architecture suits the kind of problem you are dealing with, the training set is big and representative enough, etc.)

1 Like

Typically what would you define as “…random small values…” for w and b?

The course deals with all that kind of stuff. You will find details in the coming lessons, so be patient! :slight_smile:

3 Likes

For Logistic Regression, it is not necessary to break symmetry by using random initialization. It can learn from initial values of all zeros or any other symmetric initialization. Here is a thread which discusses that and shows the math behind this point. What you will see on that thread is that symmetry breaking is required once we graduate to real neural networks, but LR is a special case.

Now it may be possible that using random values could allow for faster learning, but I did a few quick experiments a couple of years ago and it didn’t really seem to make any significant difference in the performance of the training (the cpu time and wall clock time to achieve a given level of convergence). But perhaps I didn’t use a sophisticated enough initialization algorithm. I have not taken MLS, so I don’t know what is discussed there, but initialization algorithms are covered in DLS Course 1 and Course 2 in some detail. So as Luis says, “stay tuned” for more information on that.

1 Like

That is the definition of the Logistic Regression algorithm. He could have used a polynomial function of higher degree, but using a linear function allows every feature to have a tunable effect on the output. There is also a nice geometric interpretation of the meaning of LR: the decision boundary is a hyperplane in the input space and all the “yes” answers are on one side of the plane and all “no” answers are on the other side. Of course the decision boundary is:

g(z) = 0.5

and since g(z) is sigmoid, we know that means:

z = 0

or

\displaystyle \sum_{i = 1}^n w_i * x_i + b = 0

If you think about what the means geometrically, w is the normal vector to the plane and b is determined by the perpendicular distance from the origin to the decision boundary plane.

What do you mean by “…break symmetry…”?

Did you read the thread I linked?

Just to be consistent with the nomenclature in the course shouldn’t the last summation equation be written as:

\sum_{j = 1}^{n} w_j x_j^{(i)} + b = 0

As I mentioned, I’m not familiar with the material in MLS, but I am very familiar with the material in DLS. It also covers Logistic Regression in some detail. I used i instead of j as the index for the loop, but I could just as well have used fred or barney. In math, the symbols are just placeholders, right? They don’t really mean anything: the question is what operations we perform on them.

If you initialize all the weights to be zero, then they are symmetric, meaning they are all the same. If you initialize them to random values, then they are all different values. So they are no longer the same and no longer symmetric. Thus we say that symmetry is “broken”.

The thread I linked shows the math proving that LR can be trained with weights that are symmetric and, in particular, all zero.

We’re not using a linear regression prediction model. What you’re highlighting is an internal part of the logistic regression model.