Question about is_logit

ricardowu1112 · November 22, 2022, 9:53am

Hi, Ive got a question about is_logit, Prof. Ng said it makes z directly sent to compute loss without computing intermediate a? but why the output is linear activation, I am confused with that.

ricardowu1112 · November 22, 2022, 9:54am

sorry that should be from_logits.
By the way, I wanna know what is logit, and the progress involved as well as the output.
Thanks

gent.spah · November 22, 2022, 10:09am

Logit means prediction. Its a linear activation in the final layer as it outputs just a scalar number vs. lets say relu clips all negative values. This depends on your application and desired predicted output as well as cost function etc.

rmwkwok · November 22, 2022, 11:43am

In addition to @gent.spah’s explanation, from another angle, we want to specify linear activation because the linear activation does nothing. We want a = z and linear gives us that.

Raymond

ricardowu1112 · November 22, 2022, 2:47pm

What I get is that instead of giving a as the output, it gives the f(Z) directly right? (Or Z? I think it more likely be be Z as the output?, as z is linear combination of w,x and b, so the output becomes linear.) But how about the hidden layer instead of output? Will they be Z also?

rmwkwok · November 22, 2022, 4:32pm

Hello @ricardowu1112, you did not define f in f(Z).

The output of the neural network (or the output of the output layer of the neural network) is \vec{a}^{[3]} = \vec{z}^{[3]} = W^{[3]}\vec{a}^{[2]} + \vec{b}^{[3]} where \vec{a}^{[2]} is the output of the 2nd hidden layer. Note that it is \vec{a}^{[2]}, but NOT x. Since we are using linear activation in the output layer, \vec{a}^{[3]} = \vec{z}^{[3]}.

Raymond

ricardowu1112 · November 22, 2022, 6:01pm

Oh, so the hidden layer won’t be affected? Can you help illustrate the progress involved? Like W^1x+b^1=a^1 for the first layer, W^2a^1+b^2=a^2 for the second layer, The output is directly W^3a^2+b^3 without a^3? I think showing the illustration of the process helps a lot though it may be somewhat disturbing

rmwkwok · November 22, 2022, 9:48pm

Hi @ricardowu1112,

I can’t agree your formula for the first 2 layers unless they use linear activation, do they? Otherwise, can you add the activation functions back?

For the output layer which is supposed to use linear activation, I have shown the formula in my previous reply. What is unclear about that formula?

Cheers,
Raymond

ricardowu1112 · November 23, 2022, 7:48am

I have googled and found out that when from_logit = True, then the SCCE will adopt softmax on the final output, while if from_logit = False, we need to manually adopt softmax activation level right? Can I know whether other models can use from_logit? I guess it depends on what I want to get? Because I want to get a multi-label result, so I use SCCE.

rmwkwok · November 23, 2022, 8:31am

Question 1: are you doing Binary classification or Multi-class?

If you do Binary, you need one neuron in the output layer and use BinaryCrossentropy.

if you do Multi-class, you need the same number of neurons as the number of classes in the output layer, and use SparseCategoricalCrossentropy or CategoricalCrossentropy.

Question 2: do you prefer to use from_logit = True?

When you set from_logit = True, you are telling Tensorflow that your NN outputs logits, which means you will use linear as the output layer’s activation.

When you set from_logit = False, you are telling Tensorflow that your NN outputs probabilities, which means you will use sigmoid (for Binary) or softmax (for Multi-class) as the output layer’s activation.

Cheers,
Raymond

ricardowu1112 · November 23, 2022, 9:58am

I see that if I use from_digit= True when using SCCE as loss function, the final output still be softmax output right?

rmwkwok · November 23, 2022, 10:02am

When you set from_logit = True, you are telling Tensorflow that your NN outputs logits, which means you will use linear as the output layer’s activation.

This means your NN doesnt have a softmax output.

Why don’t you try it out yourself?

Joan_Concha · August 8, 2023, 10:47pm

Greetings!

I have a question regarding your reply Raymond:

Question 2: do you prefer to use from_logit = True?

When you set from_logit = True, you are telling Tensorflow that your NN outputs logits, which means you will use linear as the output layer’s activation.

When you set from_logit = False, you are telling Tensorflow that your NN outputs probabilities, which means you will use sigmoid (for Binary) or softmax (for Multi-class) as the output layer’s activation.

Since that video lesson was about Softmax, I got also confused when Mr. Ng changed the output layer’s activation from Softmax to Linear.
Since the purpose of using from_logist = True was to optimize your code, that can only be done when your output layer’s activation is linear right? And for cases where your output layer’s activation is logistic or softmax, you just can’t use from_logist anymore (which means that you will have to use the code Mr. Ng said not to use in the “Neural Network with Softmax output” video). Is my conclusion correct?

Appreciate your help,
JC

rmwkwok · August 8, 2023, 11:24pm

Hello JC @Joan_Concha ,

Yes, we should only use one of the following four settings:

problem	output layer activation	from_logit
binary	`linear`	`True`
binary	`sigmoid`	`False`
multi-class	`linear`	`True`
multi-class	`softmax`	`False`

Cheers,
Raymond

Joan_Concha · August 9, 2023, 12:16am

Thanks!

vmmf89 · August 16, 2023, 6:40pm

@ rmwkwok I don’t understand something. If by using from_logits = True we are merging the sigmoid activation function mathematical operation 1/(1+e^-z) into the loss function equation and thus avoiding an intermediate step that may introduce round-off errors. Why doesn’t the output layer output the probabilities and we are forced to do a tf.nn.sigmoid(logit)? My understanding was that the output layer activation was switched to linear to avoid doing again the sigmoid, but this had already been done once in the loss function

vmmf89 · August 16, 2023, 6:49pm

I understand that the output layer activation is linear, but if the loss function included the sigmoid/softmax operation already why doesn’t the output layer returns probabilities?

rmwkwok · August 16, 2023, 11:39pm

@vmmf89, the program only goes through the loss function during training, but not during making prediction.

Cheers,
Raymond

eix_rap · August 29, 2023, 2:52am

Hey @rmwkwok ,
This thread is really helpful. I just need a bit of clarification, I tried looking into the source code but couldn’t find out where exactly this is happening. Whenis_logits is true, p isn’t calculated, so the loss function now has to choose which function it has to apply to transform these logits to a meaningful probabilities. How is this function chosen, does it vary for each Loss function? Does Keras provide a param where we can define which transformation function to use when p isn’t already calculated?

rmwkwok · August 29, 2023, 4:12am

Hello @eix_rap,

First, tensorflow does not need to choose anything. In contrast, we choose to use, for example, tf.keras.losses.BinaryCrossentropy and once we decided that from_logits=True, then tensorflow will have no choice but to compute the binary cross entropy loss (as taught in the lecture).

Would you mind telling me where gives you the impression that there is a choice? Moreover, what are those possible choices? I believe that if you share your view for my questions, it will help clarify things.

On the other hand, tf.keras.losses.BinaryCrossentropy does NOT compute the probabilities. The probabilies are NEVER computed out. The lecture had explained that by showing the from_logits=False version and the from_logits=True version:

In the slide, a (which is equvialent to your p) is the probability and only exists in the from_logits=False version (also called original loss in the slide). It ONLY exists in the original loss. ONLY. Therefore, in the from_logits=True version, tensorflow will not ever need to compute the probabilities.

This is not about moving the computation of probabilies from where we can see it to where we cannot see it. This is about to NOT compute the probabilies at all.

The equation shown on the bottom of the slide might make you feel that we are still going to go through the process of computing the probabilities out. It is NOT. That equation is only the first step of a series of simplification that will ultimately save us from having to compute the probabilties. That equation can be transformed into another equation that will not compute the probabilities out.

Check this out https://github.com/tensorflow/tensorflow/blob/v2.13.0/tensorflow/python/ops/nn_impl.py#L113

It has a bit of (not complete) steps that shows how tensorflow simplified the bottom equation in the slide. For the full steps, I am afraid you will need to give it a try, but please feel free to show me your trials and I can take a look for you.

Cheers,
Raymond

Topic		Replies	Views
Why "logit" stands for the output of linear activation function Advanced Learning Algorithms week-2	9	751	December 18, 2023
Week 2, prog_assgn, Ex-2 Convolutional Neural Networks	5	529	October 25, 2021
What exactly does the improved implementation of softmax video mean? Advanced Learning Algorithms week-2	9	813	August 18, 2023
Practice quiz: Multiclass Classification Advanced Learning Algorithms week-2	1	534	June 18, 2022
Improved implementation of softmax - Neural network training \| Coursera Advanced Learning Algorithms week-2	1	64	June 25, 2024

Question about is_logit

Related topics