Cost Function of Logistic Regression , Binary Classification

I have a question regarding the logistic regression cost function.

Should the x value from the training data set be placed in both logarithm functions (1*logf(x)) and log(1-f(x)) of the cost function, or should it only be placed in one of these logarithm functions based on the actual output y of that x, similar to how we did it in linear regression?

The complete formula for logistic regression is:
average over samples for (y_target * log(f(x)) + ((1-y_target) * log(1-f(x)))
where f(x) is y_predicted (the output y of that x from the model).

This formula already “chooses” which log function to use by making the results of one of those log functions 0. Therefore, you can place the x value in both the log functions.

The y_target values can only be 0 or 1.

If y_target = 0, then y_target * log(f(x)) would be 0, and so you would be left with (1-y_target) * log(1-f(x)) or just log(1-f(x)) for that sample.

If y_target = 1, then (1-y_target) * log(1-f(x)) would be 0, and so you would be left with y_target * log(f(x)) or just log(f(x)) for that sample.

1 Like

You answered me with something that Prof. Andrew told us. and That’s already in my knowledge

my question :
(y_target * log(f(x)) + ((1-y_target) * log(1-f(x)))

suppose x=2
then would it be
a) y_target=1* log(f(x=2) + (1-y_target=0)* log(1-f(x=2))) ?
b) either a) log(f(x=2) or log(1-f(x=2))) depending on the exact output in between 0 and 1?

It’s not entirely clear to me what you’re asking.

If x=2 in that sample, then you can always use f(x=2) in both the log functions. The formula will “choose” which log function is used for the cost. Which log function is chosen depends on y_target, it does not depend on x or f(x).

Although they may look similar, y_target is a completely different value from y_prediction (or f(x)).

The y_target is provided in the data, and must be exactly 0 or exactly 1. For example, say you are using logistic regression to decide if an image contains a dog or does not contain a dog. In your data, you would have images and labels for “contains dog” or “no dog”. That label is y_target, and each image can either contain (y_target=1) or not contain (y_target=0) a dog. In this case, y_target cannot be 0.5, since it is a yes/no (binary) label.

The y_prediction (or f(x)), on the other hand, is computed/output by the model, and is different from y_target. It is possible for f(x) to be any real number between 0 and 1 (assuming you have a sigmoid activation function). For example, y_prediction, or f(x), can be 0.6. The y_prediction can be thought of as the probability that an image contains a dog.

1 Like

I’ll do a little guessing here as to what your question is.

Yes, x is used in both places where f(x) is noted.

  • f(x) is the sigmoid of (x*w + b), and those values will be between 0 and 1, but never exactly equals either.

  • The y_target values are limited to 0 or 1 - there are no in-between values.

Although I don’t know exactly where this equation comes from.

hi Tom,
I will return to this discussion after some time…

Hi mentor
I understand this is a math question, yet it will help me to understand why the definition of logistic loss function works. Could you explain for me why -log(1-f) graph looks like this (when 0<f<1, 0< -log(1-f)<infinite)?

Many thanks

We don’t use z values of exactly 0 and 1, since the log(z) and log(1-z) functions explode there.

Just use values that are extremely close to 0 and 1.

The graph looks like that because that’s just the way the formula and the log function works!

It sounds like you want to be convinced of the graph for the case that 0 < f < 1 , then 0 < -log(1-f) < infinite. You can actually try to plot this out by hand (I calculated the log numbers here using Google search).

f = 0.01, then -log(1-f) = 0.0043648054
f = 0.1, then -log(1-f) = 0.04575749056
f = 0.2, then -log(1-f) = 0.096910013
f = 0.3, then -log(1-f) = 0.15490195998
f = 0.4, then -log(1-f) = 0.22184874961
f = 0.5, then -log(1-f) = 0.30102999566
f = 0.6, then -log(1-f) = 0.39794000867
f = 0.7, then -log(1-f) = 0.52287874528
f = 0.8, then -log(1-f) = 0.69897000433
f = 0.9, then -log(1-f) = 1
f = 0.99, then -log(1-f) = 2
f = 0.999999, then -log(1-f) = 5.99999999999
f = 0.99999999999999, then -log(1-f) = 14.0003472608

If you plot out f and -log(1-f) by putting the dots on a grid and drawing a line through them (or use an Excel/Sheets program ;)), you will see this looks like the one in the lecture slide.

1 Like