Hi, Ive got a question about is_logit, Prof. Ng said it makes z directly sent to compute loss without computing intermediate a? but why the output is linear activation, I am confused with that.
sorry that should be from_logits.
By the way, I wanna know what is logit, and the progress involved as well as the output.
Thanks
Logit means prediction. Its a linear activation in the final layer as it outputs just a scalar number vs. lets say relu clips all negative values. This depends on your application and desired predicted output as well as cost function etc.
In addition to @gent.spah’s explanation, from another angle, we want to specify linear
activation because the linear
activation does nothing. We want a = z
and linear
gives us that.
Raymond
What I get is that instead of giving a as the output, it gives the f(Z) directly right? (Or Z? I think it more likely be be Z as the output?, as z is linear combination of w,x and b, so the output becomes linear.) But how about the hidden layer instead of output? Will they be Z also?
Hello @ricardowu1112, you did not define f
in f(Z)
.
The output of the neural network (or the output of the output layer of the neural network) is \vec{a}^{[3]} = \vec{z}^{[3]} = W^{[3]}\vec{a}^{[2]} + \vec{b}^{[3]} where \vec{a}^{[2]} is the output of the 2nd hidden layer. Note that it is \vec{a}^{[2]}, but NOT x. Since we are using linear activation in the output layer, \vec{a}^{[3]} = \vec{z}^{[3]}.
Raymond
Oh, so the hidden layer won’t be affected? Can you help illustrate the progress involved? Like W^1x+b^1=a^1 for the first layer, W^2a^1+b^2=a^2 for the second layer, The output is directly W^3a^2+b^3 without a^3? I think showing the illustration of the process helps a lot though it may be somewhat disturbing
Hi @ricardowu1112,
I can’t agree your formula for the first 2 layers unless they use linear activation, do they? Otherwise, can you add the activation functions back?
For the output layer which is supposed to use linear activation, I have shown the formula in my previous reply. What is unclear about that formula?
Cheers,
Raymond
I have googled and found out that when from_logit = True, then the SCCE will adopt softmax on the final output, while if from_logit = False, we need to manually adopt softmax activation level right? Can I know whether other models can use from_logit? I guess it depends on what I want to get? Because I want to get a multi-label result, so I use SCCE.
Question 1: are you doing Binary classification or Multi-class?
If you do Binary, you need one neuron in the output layer and use BinaryCrossentropy.
if you do Multi-class, you need the same number of neurons as the number of classes in the output layer, and use SparseCategoricalCrossentropy or CategoricalCrossentropy.
Question 2: do you prefer to use from_logit = True
?
When you set from_logit = True
, you are telling Tensorflow that your NN outputs logits, which means you will use linear
as the output layer’s activation.
When you set from_logit = False
, you are telling Tensorflow that your NN outputs probabilities, which means you will use sigmoid
(for Binary) or softmax
(for Multi-class) as the output layer’s activation.
Cheers,
Raymond
I see that if I use from_digit= True when using SCCE as loss function, the final output still be softmax output right?
When you set from_logit = True, you are telling Tensorflow that your NN outputs logits, which means you will use linear as the output layer’s activation.
This means your NN doesnt have a softmax output.
Why don’t you try it out yourself?
Greetings!
I have a question regarding your reply Raymond:
Question 2: do you prefer to use
from_logit = True
?When you set
from_logit = True
, you are telling Tensorflow that your NN outputs logits, which means you will uselinear
as the output layer’s activation.When you set
from_logit = False
, you are telling Tensorflow that your NN outputs probabilities, which means you will usesigmoid
(for Binary) orsoftmax
(for Multi-class) as the output layer’s activation.
Since that video lesson was about Softmax, I got also confused when Mr. Ng changed the output layer’s activation from Softmax to Linear.
Since the purpose of using from_logist = True was to optimize your code, that can only be done when your output layer’s activation is linear right? And for cases where your output layer’s activation is logistic or softmax, you just can’t use from_logist anymore (which means that you will have to use the code Mr. Ng said not to use in the “Neural Network with Softmax output” video). Is my conclusion correct?
Appreciate your help,
JC
Hello JC @Joan_Concha ,
Yes, we should only use one of the following four settings:
problem | output layer activation | from_logit |
---|---|---|
binary | linear |
True |
binary | sigmoid |
False |
multi-class | linear |
True |
multi-class | softmax |
False |
Cheers,
Raymond
Thanks!
@ rmwkwok I don’t understand something. If by using from_logits = True
we are merging the sigmoid activation function mathematical operation 1/(1+e^-z)
into the loss function equation and thus avoiding an intermediate step that may introduce round-off errors. Why doesn’t the output layer output the probabilities and we are forced to do a tf.nn.sigmoid(logit)
? My understanding was that the output layer activation was switched to linear to avoid doing again the sigmoid, but this had already been done once in the loss function
I understand that the output layer activation is linear, but if the loss function included the sigmoid/softmax operation already why doesn’t the output layer returns probabilities?
@vmmf89, the program only goes through the loss function during training, but not during making prediction.
Cheers,
Raymond
Hey @rmwkwok ,
This thread is really helpful. I just need a bit of clarification, I tried looking into the source code but couldn’t find out where exactly this is happening. Whenis_logits
is true, p
isn’t calculated, so the loss function now has to choose which function it has to apply to transform these logits to a meaningful probabilities. How is this function chosen, does it vary for each Loss function? Does Keras provide a param where we can define which transformation function to use when p
isn’t already calculated?
Hello @eix_rap,
First, tensorflow does not need to choose anything. In contrast, we choose to use, for example, tf.keras.losses.BinaryCrossentropy
and once we decided that from_logits=True
, then tensorflow will have no choice but to compute the binary cross entropy loss (as taught in the lecture).
Would you mind telling me where gives you the impression that there is a choice? Moreover, what are those possible choices? I believe that if you share your view for my questions, it will help clarify things.
On the other hand, tf.keras.losses.BinaryCrossentropy
does NOT compute the probabilities. The probabilies are NEVER computed out. The lecture had explained that by showing the from_logits=False
version and the from_logits=True
version:
In the slide, a
(which is equvialent to your p
) is the probability and only exists in the from_logits=False
version (also called original loss in the slide). It ONLY exists in the original loss. ONLY. Therefore, in the from_logits=True
version, tensorflow will not ever need to compute the probabilities.
This is not about moving the computation of probabilies from where we can see it to where we cannot see it. This is about to NOT compute the probabilies at all.
The equation shown on the bottom of the slide might make you feel that we are still going to go through the process of computing the probabilities out. It is NOT. That equation is only the first step of a series of simplification that will ultimately save us from having to compute the probabilties. That equation can be transformed into another equation that will not compute the probabilities out.
Check this out https://github.com/tensorflow/tensorflow/blob/v2.13.0/tensorflow/python/ops/nn_impl.py#L113
It has a bit of (not complete) steps that shows how tensorflow simplified the bottom equation in the slide. For the full steps, I am afraid you will need to give it a try, but please feel free to show me your trials and I can take a look for you.
Cheers,
Raymond