Hi, Ive got a question about is_logit, Prof. Ng said it makes z directly sent to compute loss without computing intermediate a? but why the output is linear activation, I am confused with that.

sorry that should be from_logits.

By the way, I wanna know what is logit, and the progress involved as well as the output.

Thanks

Logit means prediction. Its a linear activation in the final layer as it outputs just a scalar number vs. lets say relu clips all negative values. This depends on your application and desired predicted output as well as cost function etc.

In addition to @gent.spahâ€™s explanation, from another angle, we want to specify `linear`

activation because the `linear`

activation does nothing. We want `a = z`

and `linear`

gives us that.

Raymond

What I get is that instead of giving a as the output, it gives the f(Z) directly right? (Or Z? I think it more likely be be Z as the output?, as z is linear combination of w,x and b, so the output becomes linear.) But how about the hidden layer instead of output? Will they be Z also?

Hello @ricardowu1112, you did not define `f`

in `f(Z)`

.

The output of the neural network (or the output of the output layer of the neural network) is \vec{a}^{[3]} = \vec{z}^{[3]} = W^{[3]}\vec{a}^{[2]} + \vec{b}^{[3]} where \vec{a}^{[2]} is the output of the 2nd hidden layer. Note that it is \vec{a}^{[2]}, but **NOT** x. Since we are using linear activation in the output layer, \vec{a}^{[3]} = \vec{z}^{[3]}.

Raymond

Oh, so the hidden layer wonâ€™t be affected? Can you help illustrate the progress involved? Like W^1x+b^1=a^1 for the first layer, W^2a^1+b^2=a^2 for the second layer, The output is directly W^3a^2+b^3 without a^3? I think showing the illustration of the process helps a lot though it may be somewhat disturbing

Hi @ricardowu1112,

I canâ€™t agree your formula for the first 2 layers unless they use linear activation, do they? Otherwise, can you add the activation functions back?

For the output layer which is supposed to use linear activation, I have shown the formula in my previous reply. What is unclear about that formula?

Cheers,

Raymond

I have googled and found out that when from_logit = True, then the SCCE will adopt softmax on the final output, while if from_logit = False, we need to manually adopt softmax activation level right? Can I know whether other models can use from_logit? I guess it depends on what I want to get? Because I want to get a multi-label result, so I use SCCE.

**Question 1: are you doing Binary classification or Multi-class?**

If you do Binary, you need one neuron in the output layer and use BinaryCrossentropy.

if you do Multi-class, you need the same number of neurons as the number of classes in the output layer, and use SparseCategoricalCrossentropy or CategoricalCrossentropy.

**Question 2: do you prefer to use from_logit = True?**

When you set `from_logit = True`

, you are telling Tensorflow that your NN outputs logits, which means you will use `linear`

as the output layerâ€™s activation.

When you set `from_logit = False`

, you are telling Tensorflow that your NN outputs probabilities, which means you will use `sigmoid`

(for Binary) or `softmax`

(for Multi-class) as the output layerâ€™s activation.

Cheers,

Raymond

I see that if I use from_digit= True when using SCCE as loss function, the final output still be softmax output right?

When you set from_logit = True, you are telling Tensorflow that your NN outputs logits, which means you will use linear as the output layerâ€™s activation.

This means your NN doesnt have a softmax output.

Why donâ€™t you try it out yourself?

Greetings!

I have a question regarding your reply Raymond:

Question 2: do you prefer to use`from_logit = True`

?When you set

`from_logit = True`

, you are telling Tensorflow that your NN outputs logits, which means you will use`linear`

as the output layerâ€™s activation.When you set

`from_logit = False`

, you are telling Tensorflow that your NN outputs probabilities, which means you will use`sigmoid`

(for Binary) or`softmax`

(for Multi-class) as the output layerâ€™s activation.

Since that video lesson was about Softmax, I got also confused when Mr. Ng changed the output layerâ€™s activation from Softmax to Linear.

Since the purpose of using **from_logist = True** was to optimize your code, that can only be done when your output layerâ€™s activation is linear right? And for cases where your output layerâ€™s activation is logistic or softmax, you just canâ€™t use **from_logist** anymore (which means that you will have to use the code Mr. Ng said not to use in the â€śNeural Network with Softmax outputâ€ť video). Is my conclusion correct?

Appreciate your help,

JC

Hello JC @Joan_Concha ,

Yes, we should only use one of the following four settings:

problem | output layer activation | from_logit |
---|---|---|

binary | `linear` |
`True` |

binary | `sigmoid` |
`False` |

multi-class | `linear` |
`True` |

multi-class | `softmax` |
`False` |

Cheers,

Raymond

Thanks!

@ rmwkwok I donâ€™t understand something. If by using `from_logits = True`

we are merging the sigmoid activation function mathematical operation `1/(1+e^-z)`

into the loss function equation and thus avoiding an intermediate step that may introduce round-off errors. Why doesnâ€™t the output layer output the probabilities and we are forced to do a `tf.nn.sigmoid(logit)`

? My understanding was that the output layer activation was switched to linear to avoid doing again the sigmoid, but this had already been done once in the loss function

I understand that the output layer activation is linear, but if the loss function included the sigmoid/softmax operation already why doesnâ€™t the output layer returns probabilities?

@vmmf89, the program only goes through the loss function during training, but not during making prediction.

Cheers,

Raymond

Hey @rmwkwok ,

This thread is really helpful. I just need a bit of clarification, I tried looking into the source code but couldnâ€™t find out where exactly this is happening. When`is_logits`

is true, `p`

isnâ€™t calculated, so the loss function now has to choose which function it has to apply to transform these logits to a meaningful probabilities. How is this function chosen, does it vary for each Loss function? Does Keras provide a param where we can define which transformation function to use when `p`

isnâ€™t already calculated?

Hello @eix_rap,

First, tensorflow does not need to choose anything. In contrast, we choose to use, for example, `tf.keras.losses.BinaryCrossentropy`

and once we decided that `from_logits=True`

, then tensorflow will have no choice but to compute the binary cross entropy loss (as taught in the lecture).

**Would you mind telling me where gives you the impression that there is a choice? Moreover, what are those possible choices? I believe that if you share your view for my questions, it will help clarify things.**

On the other hand, `tf.keras.losses.BinaryCrossentropy`

does NOT compute the probabilities. The probabilies are NEVER computed out. The lecture had explained that by showing the `from_logits=False`

version and the `from_logits=True`

version:

In the slide, `a`

(which is equvialent to your `p`

) is the probability and only exists in the `from_logits=False`

version (also called original loss in the slide). It ONLY exists in the original loss. ONLY. Therefore, in the `from_logits=True`

version, tensorflow will not ever need to compute the probabilities.

This is not about moving the computation of probabilies from where we can see it to where we cannot see it. This is about to NOT compute the probabilies at all.

The equation shown on the bottom of the slide might make you feel that we are still going to go through the process of computing the probabilities out. It is NOT. That equation is only the first step of a series of simplification that will ultimately save us from having to compute the probabilties. That equation can be transformed into another equation that will not compute the probabilities out.

Check this out https://github.com/tensorflow/tensorflow/blob/v2.13.0/tensorflow/python/ops/nn_impl.py#L113

It has a bit of (not complete) steps that shows how tensorflow simplified the bottom equation in the slide. For the full steps, I am afraid you will need to give it a try, but please feel free to show me your trials and I can take a look for you.

Cheers,

Raymond