Question about is_logit

eix_rap · August 30, 2023, 1:46am

I was a bit unclear there earlier. You’re right, we’re trying to eliminate the round off error by not computing the probability, and it is a change to the loss function formula itself.

This idea ran across my mind, as I was under the belief that when the is_logits is true the formula for loss function is calculated by substituting the input with another formula. What I failed to consider was this was just part of the explanation and the start of transformation of formula. Under that assumption, I believed any activation fn’ formula can be substituted for a transformation.
Thanks,
Eakanath

rmwkwok · August 30, 2023, 1:53am

Hello Eakanath @eix_rap,

When you said “any activation function’s formula”, I feel a little bit “unsafe” .

Please come back later, and I will show you the steps on how to transform “that bottom equation in the slide I shared” into “an equation that will not compute the probabilities out”. Then you can make another judgement.

Raymond

rmwkwok · August 30, 2023, 2:35am

Hello Eakanath @eix_rap,

Here are the steps.

I want you to look at how we have avoided computing e^{(-z)} where z is too negatively large to overflow the computed result. We have e^{(-z)} because we use sigmoid for binary classification, or we have e^{(z)} because we use softmax for multi-class classification. Note the importance of the choice of the activation, I was therefore feeling unsafe when you said “any activation”.

Ofcourse, I might be over-reacting because you might always have been thinking only about sigmoid and softmax, but please don’t mind and let me make it a bit clearer

Cheers,
Raymond

eix_rap · August 30, 2023, 2:37am

Hey Raymond,

Logarithmic operations right? in the loss function, -y * log(1/(1 + np.exp(z) )) becomes y*log(1 + np.exp(z)). Now if the value of Z is too low (-ve) or too high (+ve), we won’t get skewed values.

rmwkwok · August 30, 2023, 2:39am

I think you have got the idea. For example, we want to avoid e^{10000}, but it would be fine to have e^{-10000}.

eix_rap · August 30, 2023, 2:43am

Wouldn’t that mean a near zero loss, again I’m speaking in the context of the function I mentioned above y*log(1 + np.exp(z))

rmwkwok · August 30, 2023, 2:51am

Being close to zero isn’t a problem. It doesn’t overflow.

In Python, we don’t really feel about the overflow problem because Python handles it away for us. However, when we use Tensorflow, it doesn’t store number in Pythonic variable, but in variable of a fixed memory size. Take a 16-bit floating point number as an example, it can hold a number in just the following small range on my system:

Therefore, e^{20} (not to mention e^{10000}) will overflow the variable, and it is the problem we want to get rid of.

You questioned e^{-10000} of being zero. Yes, it will be zero, but that is just a problem of insufficient precision, and it should be close to zero anyway.

eix_rap · August 30, 2023, 2:55am

This is a perfect explanation. This overflow problem will help me with any issues in the future too, Thanks a lot

rmwkwok · August 30, 2023, 2:58am

Don’t forget about the steps . You might have got the idea of overflow, but the steps are about the tricks for you to actually handle it, especially the part when we separate an equation into two cases. You might find it handy in the future.

Khalid_A.W · February 15, 2024, 4:02am

Raymond,

Thank you for all the information in this thread. I understand why we want to avoid using an intermediate value for ‘a’ (just like we do not round our answer until the final step in algebra). However, I am struggling to understand why it is “preferable” in this case to use linear instead of sigmoid for the activation in a binary classification problem. I remember, from previous lectures, that sigmoid is almost always the best choice for activating the final layer of a binary classification NN. Could you please clarify?

Also, to make sure I am understanding this implementation correctly, do we get the same result whether we use “from_logits = True” as manually changing the activation of the output layer from “softmax” to “linear” (without adding the “from_logits = True” argument to the loss function)?

Best,
AK

rmwkwok · February 17, 2024, 3:38am

Hello @Khalid_A.W,

We don’t just prefer to use linear instead of sigmoid for activation. We prefer to use linear for activation AND setting from_logit to True in the loss function that’s passed into the model training. If you following these steps, you will see that sigmoid is never out of the game.

Again, sigmoid is there if we set from_logit to True. It is NOT in the output layer because we have linear for the layer’s activation, however, it IS in the loss function if we set from_logit to True.

Do the experiment yourself

Cheers,
Raymond

Topic		Replies	Views
Why "logit" stands for the output of linear activation function Advanced Learning Algorithms week-2	9	764	December 18, 2023
Week 2, prog_assgn, Ex-2 Convolutional Neural Networks	5	529	October 25, 2021
What exactly does the improved implementation of softmax video mean? Advanced Learning Algorithms week-2	9	817	August 18, 2023
Practice quiz: Multiclass Classification Advanced Learning Algorithms week-2	1	537	June 18, 2022
Improved implementation of softmax - Neural network training \| Coursera Advanced Learning Algorithms week-2	1	67	June 25, 2024

Question about is_logit

Related topics