Question about is_logit

Hey @rmwkwok

I was a bit unclear there earlier. You’re right, we’re trying to eliminate the round off error by not computing the probability, and it is a change to the loss function formula itself.

This idea ran across my mind, as I was under the belief that when the is_logits is true the formula for loss function is calculated by substituting the input with another formula. What I failed to consider was this was just part of the explanation and the start of transformation of formula. Under that assumption, I believed any activation fn’ formula can be substituted for a transformation.
Thanks,
Eakanath

Hello Eakanath @eix_rap,

When you said “any activation function’s formula”, I feel a little bit “unsafe” :wink: .

Please come back later, and I will show you the steps on how to transform “that bottom equation in the slide I shared” into “an equation that will not compute the probabilities out”. Then you can make another judgement.

Raymond

Hello Eakanath @eix_rap,

Here are the steps.

I want you to look at how we have avoided computing e^{(-z)} where z is too negatively large to overflow the computed result. We have e^{(-z)} because we use sigmoid for binary classification, or we have e^{(z)} because we use softmax for multi-class classification. Note the importance of the choice of the activation, I was therefore feeling unsafe when you said “any activation”.

Ofcourse, I might be over-reacting because you might always have been thinking only about sigmoid and softmax, but please don’t mind and let me make it a bit clearer :wink:

Cheers,
Raymond

Hey Raymond,

Logarithmic operations right? in the loss function, -y * log(1/(1 + np.exp(z) )) becomes y*log(1 + np.exp(z)). Now if the value of Z is too low (-ve) or too high (+ve), we won’t get skewed values.

I think you have got the idea. For example, we want to avoid e^{10000}, but it would be fine to have e^{-10000}.

Wouldn’t that mean a near zero loss, again I’m speaking in the context of the function I mentioned above y*log(1 + np.exp(z))

Being close to zero isn’t a problem. It doesn’t overflow.

In Python, we don’t really feel about the overflow problem because Python handles it away for us. However, when we use Tensorflow, it doesn’t store number in Pythonic variable, but in variable of a fixed memory size. Take a 16-bit floating point number as an example, it can hold a number in just the following small range on my system:

image

Therefore, e^{20} (not to mention e^{10000}) will overflow the variable, and it is the problem we want to get rid of.

You questioned e^{-10000} of being zero. Yes, it will be zero, but that is just a problem of insufficient precision, and it should be close to zero anyway.

2 Likes

This is a perfect explanation. This overflow problem will help me with any issues in the future too, Thanks a lot

Don’t forget about the steps . You might have got the idea of overflow, but the steps are about the tricks for you to actually handle it, especially the part when we separate an equation into two cases. You might find it handy in the future.

1 Like

Raymond,

Thank you for all the information in this thread. I understand why we want to avoid using an intermediate value for ‘a’ (just like we do not round our answer until the final step in algebra). However, I am struggling to understand why it is “preferable” in this case to use linear instead of sigmoid for the activation in a binary classification problem. I remember, from previous lectures, that sigmoid is almost always the best choice for activating the final layer of a binary classification NN. Could you please clarify?

Also, to make sure I am understanding this implementation correctly, do we get the same result whether we use “from_logits = True” as manually changing the activation of the output layer from “softmax” to “linear” (without adding the “from_logits = True” argument to the loss function)?

Best,
AK

Hello @Khalid_A.W,

We don’t just prefer to use linear instead of sigmoid for activation. We prefer to use linear for activation AND setting from_logit to True in the loss function that’s passed into the model training. If you following these steps, you will see that sigmoid is never out of the game.

Again, sigmoid is there if we set from_logit to True. It is NOT in the output layer because we have linear for the layer’s activation, however, it IS in the loss function if we set from_logit to True.

Do the experiment yourself :wink:

Cheers,
Raymond

1 Like