Hi, Ive got a question about is_logit, Prof. Ng said it makes z directly sent to compute loss without computing intermediate a? but why the output is linear activation, I am confused with that.

sorry that should be from_logits.
By the way, I wanna know what is logit, and the progress involved as well as the output.
Thanks

Logit means prediction. Its a linear activation in the final layer as it outputs just a scalar number vs. lets say relu clips all negative values. This depends on your application and desired predicted output as well as cost function etc.

In addition to @gent.spahâ€™s explanation, from another angle, we want to specify linear activation because the linear activation does nothing. We want a = z and linear gives us that.

Raymond

1 Like

What I get is that instead of giving a as the output, it gives the f(Z) directly right? (Or Z? I think it more likely be be Z as the output?, as z is linear combination of w,x and b, so the output becomes linear.) But how about the hidden layer instead of output? Will they be Z also?

Hello @ricardowu1112, you did not define f in f(Z).

The output of the neural network (or the output of the output layer of the neural network) is \vec{a}^{[3]} = \vec{z}^{[3]} = W^{[3]}\vec{a}^{[2]} + \vec{b}^{[3]} where \vec{a}^{[2]} is the output of the 2nd hidden layer. Note that it is \vec{a}^{[2]}, but NOT x. Since we are using linear activation in the output layer, \vec{a}^{[3]} = \vec{z}^{[3]}.

Raymond

Oh, so the hidden layer wonâ€™t be affected? Can you help illustrate the progress involved? Like W^1x+b^1=a^1 for the first layer, W^2a^1+b^2=a^2 for the second layer, The output is directly W^3a^2+b^3 without a^3? I think showing the illustration of the process helps a lot though it may be somewhat disturbing

I canâ€™t agree your formula for the first 2 layers unless they use linear activation, do they? Otherwise, can you add the activation functions back?

For the output layer which is supposed to use linear activation, I have shown the formula in my previous reply. What is unclear about that formula?

Cheers,
Raymond

I have googled and found out that when from_logit = True, then the SCCE will adopt softmax on the final output, while if from_logit = False, we need to manually adopt softmax activation level right? Can I know whether other models can use from_logit? I guess it depends on what I want to get? Because I want to get a multi-label result, so I use SCCE.

Question 1: are you doing Binary classification or Multi-class?

If you do Binary, you need one neuron in the output layer and use BinaryCrossentropy.

if you do Multi-class, you need the same number of neurons as the number of classes in the output layer, and use SparseCategoricalCrossentropy or CategoricalCrossentropy.

Question 2: do you prefer to use from_logit = True?

When you set from_logit = True, you are telling Tensorflow that your NN outputs logits, which means you will use linear as the output layerâ€™s activation.

When you set from_logit = False, you are telling Tensorflow that your NN outputs probabilities, which means you will use sigmoid (for Binary) or softmax (for Multi-class) as the output layerâ€™s activation.

Cheers,
Raymond

I see that if I use from_digit= True when using SCCE as loss function, the final output still be softmax output right?

When you set from_logit = True, you are telling Tensorflow that your NN outputs logits, which means you will use linear as the output layerâ€™s activation.

This means your NN doesnt have a softmax output.

Why donâ€™t you try it out yourself?

Greetings!

Question 2: do you prefer to use from_logit = True?

When you set from_logit = True, you are telling Tensorflow that your NN outputs logits, which means you will use linear as the output layerâ€™s activation.

When you set from_logit = False, you are telling Tensorflow that your NN outputs probabilities, which means you will use sigmoid (for Binary) or softmax (for Multi-class) as the output layerâ€™s activation.

Since that video lesson was about Softmax, I got also confused when Mr. Ng changed the output layerâ€™s activation from Softmax to Linear.
Since the purpose of using from_logist = True was to optimize your code, that can only be done when your output layerâ€™s activation is linear right? And for cases where your output layerâ€™s activation is logistic or softmax, you just canâ€™t use from_logist anymore (which means that you will have to use the code Mr. Ng said not to use in the â€śNeural Network with Softmax outputâ€ť video). Is my conclusion correct?

JC

Hello JC @Joan_Concha ,

Yes, we should only use one of the following four settings:

problem output layer activation from_logit
binary linear True
binary sigmoid False
multi-class linear True
multi-class softmax False

Cheers,
Raymond

Thanks!

@ rmwkwok I donâ€™t understand something. If by using from_logits = True we are merging the sigmoid activation function mathematical operation 1/(1+e^-z) into the loss function equation and thus avoiding an intermediate step that may introduce round-off errors. Why doesnâ€™t the output layer output the probabilities and we are forced to do a tf.nn.sigmoid(logit)? My understanding was that the output layer activation was switched to linear to avoid doing again the sigmoid, but this had already been done once in the loss function

I understand that the output layer activation is linear, but if the loss function included the sigmoid/softmax operation already why doesnâ€™t the output layer returns probabilities?

@vmmf89, the program only goes through the loss function during training, but not during making prediction.

Cheers,
Raymond

1 Like

Hey @rmwkwok ,
This thread is really helpful. I just need a bit of clarification, I tried looking into the source code but couldnâ€™t find out where exactly this is happening. Whenis_logits is true, p isnâ€™t calculated, so the loss function now has to choose which function it has to apply to transform these logits to a meaningful probabilities. How is this function chosen, does it vary for each Loss function? Does Keras provide a param where we can define which transformation function to use when p isnâ€™t already calculated?

Hello @eix_rap,

First, tensorflow does not need to choose anything. In contrast, we choose to use, for example, tf.keras.losses.BinaryCrossentropy and once we decided that from_logits=True, then tensorflow will have no choice but to compute the binary cross entropy loss (as taught in the lecture).

Would you mind telling me where gives you the impression that there is a choice? Moreover, what are those possible choices? I believe that if you share your view for my questions, it will help clarify things.

On the other hand, tf.keras.losses.BinaryCrossentropy does NOT compute the probabilities. The probabilies are NEVER computed out. The lecture had explained that by showing the from_logits=False version and the from_logits=True version:

In the slide, a (which is equvialent to your p) is the probability and only exists in the from_logits=False version (also called original loss in the slide). It ONLY exists in the original loss. ONLY. Therefore, in the from_logits=True version, tensorflow will not ever need to compute the probabilities.

This is not about moving the computation of probabilies from where we can see it to where we cannot see it. This is about to NOT compute the probabilies at all.

The equation shown on the bottom of the slide might make you feel that we are still going to go through the process of computing the probabilities out. It is NOT. That equation is only the first step of a series of simplification that will ultimately save us from having to compute the probabilties. That equation can be transformed into another equation that will not compute the probabilities out.

It has a bit of (not complete) steps that shows how tensorflow simplified the bottom equation in the slide. For the full steps, I am afraid you will need to give it a try, but please feel free to show me your trials and I can take a look for you.

Cheers,
Raymond