What exactly does the improved implementation of softmax video mean?

I don’t understand. Why we change the activation output layer into ‘linear’, and add from_logits in loss function: BinaryCrossentropy(from_logits = True)

Help meee
Thanks so much

Hello @Nhat_Minh, welcome to our community! :tada:

The motivation behind the improvement is, tensorflow works more accurately with the 3rd equation on the left then with the 2nd equation on the left. We can see that in the 3rd equation, the term a never shows up, which means that by adopting the 3rd equation, tensorflow does not need to calculate a. This is good because calculating a out can generate numerical inaccuracy which is not favourable.

To make tensorflow works without calculating a out, we need to change the activation in the output layer from sigmoid to linear, because having sigmoid there is the reason for tensorflow to compute a out. Using linear actually means that we do not need any activation function. Therefore, changing from sigmoid to linear means that we are changing from passing a = g(z) into the loss to passing z into the loss.

Moreover, we need to let Tensorflow knows we are passing z into the loss instead of a, because tensorflow cannot detect the change itself. And we notify tensorflow by adding from_logits=True there. Also, logit is the name for z.

Now, with these two code changes, we enjoy a more accurate process of model training.



Thanks so much, @rmwkwok. I understand much more. But I want to know more. Help me if you have time.

You say: we passing z into the loss instead of a. But actually, we pass 1/(1 + e^-z). I am trying to googling binary_crossentropy() function in details to understand more, putting much time, effort I haven’t searched for it.

Hello @Nhat_Minh,

Without from_logit, the function’s form takes the look of the second equation, which accepts an a. With from_logit, the function’s form takes the look of the third equation, which accepts only a z. Depending on whether from_logit is True or False, Tensorflow uses different functions.



I have follow up to this. What I’m missing is that I’m still unsure how we are skipping a calculation. Don’t we still have to compute the logistic or softmax values at some point?

1 Like

Hello @Maxim_Kupfer, thank you for the question!

The log loss value will be calculated but not the sigmoid nor the softmax value. I am going to show you that and please feel free to check my calculation if you would like to. I will use the case of a binary outcome for simplicity, but the core idea is identical.

p = \frac{1}{1+\exp{(-z)}}
Loss = -y\log{p} - (1-y)\log{(1-p)}

Here z is the logit value, and without calculating the sigmoid to get the corresponding value of p, we have the freedom to substitute p into Loss to make simplification that improves numerical stability:

Loss = -y\log{(\frac{1}{1+\exp{(-z)}})} - (1-y)\log{(1-\frac{1}{1+\exp{(-z)}})}
= -y\log{(\frac{1}{1+\exp{(-z)}})} - (1-y)\log{(\frac{\exp{(-z)}}{1+\exp{(-z)}})}
= \log(1+\exp{(-z)})- zy +z

If you calculate Loss in this manner, you will never have calculated p out, agree? Now there is one more trick to improve stability, which is to consider the case where z < 0 and z \ge 0 separately, because in the former case the \exp{(-z)} term can yield a very large value that overflows any floating point data type. This is what we are going to do when z <0:

\log(1+\exp{(-z)})- zy +z = \log{(1+\exp{(z)}))} -zy

which will never yield any exponentially large numbers because .\exp{(z)} \approx 0 when z<<0

In summary, now we have, from our original Loss function,

Loss^{\ge 0} = \log(1+\exp{(-z)}) -zy + z
Loss^{< 0} =\log{(1+\exp{(z)}))} -zy

that will never produce any exponentially large number.



Thanks for the reply @rmwkwok.

If you calculate Loss in this manner, you will never have calculated p out, agree?

That’s where I’m not understanding since aren’t you are still doing a calculation in the end with the only difference being that the terms were rearranged? Is it just that this new calculation is simplified?


My algebra is a bit rough. Could you elaborate on how you simplified in the above equation?

No problem!

The key difference is whether we explicitly calculated p or not.
Let’s say y=0 and z=-305.8, if we do

p = \frac{1}{1+\exp{(-z)}}
Loss = -y\log{p} - (1-y)\log{(1-p)}

we will have to first calculate p and for this purpose, we have to evaluate \exp{(+305.8)} = 6.415825835×10^{132}.

However, if we do

Loss^{\ge 0} = \log(1+\exp{(-z)}) -zy + z
Loss^{< 0} =\log{(1+\exp{(z)}))} -zy

we will not have to calculate p nor evaluate \exp{(+305.8)}, only \exp{(-305.8)}=0 which won’t overflow.

\log(1+\exp{(-z)}) = \log(\exp{(-z)}\times(1+\exp{(z)})) = -z + \log{(1+\exp{(z)})}



I have one query, shouldn’t using logits be the default behaviour of the softmax output in Tensorflow? As in, if using softmax leads to numerical inaccuracies, then in which scenario is it preferred over the logits way?

I cannot think of a case where logits is not preferred.

Tensorflow is a flexible framework that allows you to go the logits way or not. There is no default to that particular setting.

However, logits is my default.