What exactly does the improved implementation of softmax video mean?

Nhat_Minh · July 27, 2022, 4:49am

I don’t understand. Why we change the activation output layer into ‘linear’, and add from_logits in loss function: BinaryCrossentropy(from_logits = True)

Help meee
Thanks so much

rmwkwok · July 27, 2022, 8:14am

Hello @Nhat_Minh, welcome to our community!

The motivation behind the improvement is, tensorflow works more accurately with the 3rd equation on the left then with the 2nd equation on the left. We can see that in the 3rd equation, the term a never shows up, which means that by adopting the 3rd equation, tensorflow does not need to calculate a. This is good because calculating a out can generate numerical inaccuracy which is not favourable.

To make tensorflow works without calculating a out, we need to change the activation in the output layer from sigmoid to linear, because having sigmoid there is the reason for tensorflow to compute a out. Using linear actually means that we do not need any activation function. Therefore, changing from sigmoid to linear means that we are changing from passing a = g(z) into the loss to passing z into the loss.

Moreover, we need to let Tensorflow knows we are passing z into the loss instead of a, because tensorflow cannot detect the change itself. And we notify tensorflow by adding from_logits=True there. Also, logit is the name for z.

Now, with these two code changes, we enjoy a more accurate process of model training.

Cheers,
Raymond

Nhat_Minh · July 28, 2022, 7:38am

Thanks so much, @rmwkwok. I understand much more. But I want to know more. Help me if you have time.

You say: we passing z into the loss instead of a. But actually, we pass 1/(1 + e^-z). I am trying to googling binary_crossentropy() function in details to understand more, putting much time, effort I haven’t searched for it.

rmwkwok · July 28, 2022, 7:44am

Hello @Nhat_Minh,

Without from_logit, the function’s form takes the look of the second equation, which accepts an a. With from_logit, the function’s form takes the look of the third equation, which accepts only a z. Depending on whether from_logit is True or False, Tensorflow uses different functions.

Cheers,
Raymond

Maxim_Kupfer · August 23, 2022, 11:34pm

I have follow up to this. What I’m missing is that I’m still unsure how we are skipping a calculation. Don’t we still have to compute the logistic or softmax values at some point?

rmwkwok · August 24, 2022, 12:18am

Hello @Maxim_Kupfer, thank you for the question!

The log loss value will be calculated but not the sigmoid nor the softmax value. I am going to show you that and please feel free to check my calculation if you would like to. I will use the case of a binary outcome for simplicity, but the core idea is identical.

Given
p = \frac{1}{1+\exp{(-z)}}
Loss = -y\log{p} - (1-y)\log{(1-p)}

Here z is the logit value, and without calculating the sigmoid to get the corresponding value of p, we have the freedom to substitute p into Loss to make simplification that improves numerical stability:

Loss = -y\log{(\frac{1}{1+\exp{(-z)}})} - (1-y)\log{(1-\frac{1}{1+\exp{(-z)}})}
= -y\log{(\frac{1}{1+\exp{(-z)}})} - (1-y)\log{(\frac{\exp{(-z)}}{1+\exp{(-z)}})}
= \log(1+\exp{(-z)})- zy +z

If you calculate Loss in this manner, you will never have calculated p out, agree? Now there is one more trick to improve stability, which is to consider the case where z < 0 and z \ge 0 separately, because in the former case the \exp{(-z)} term can yield a very large value that overflows any floating point data type. This is what we are going to do when z <0:

\log(1+\exp{(-z)})- zy +z = \log{(1+\exp{(z)}))} -zy

which will never yield any exponentially large numbers because .\exp{(z)} \approx 0 when z<<0

In summary, now we have, from our original Loss function,

Loss^{\ge 0} = \log(1+\exp{(-z)}) -zy + z
Loss^{< 0} =\log{(1+\exp{(z)}))} -zy

that will never produce any exponentially large number.

Raymond

Maxim_Kupfer · August 24, 2022, 5:38pm

Thanks for the reply @rmwkwok.

If you calculate Loss in this manner, you will never have calculated p out, agree?

That’s where I’m not understanding since aren’t you are still doing a calculation in the end with the only difference being that the terms were rearranged? Is it just that this new calculation is simplified?

log(1+exp(−z))−zy+z=log(1+exp(z)))−zy

My algebra is a bit rough. Could you elaborate on how you simplified in the above equation?

rmwkwok · August 24, 2022, 10:49pm

No problem!

The key difference is whether we explicitly calculated p or not.
Let’s say y=0 and z=-305.8, if we do

p = \frac{1}{1+\exp{(-z)}}
Loss = -y\log{p} - (1-y)\log{(1-p)}

we will have to first calculate p and for this purpose, we have to evaluate \exp{(+305.8)} = 6.415825835×10^{132}.

However, if we do

Loss^{\ge 0} = \log(1+\exp{(-z)}) -zy + z
Loss^{< 0} =\log{(1+\exp{(z)}))} -zy

we will not have to calculate p nor evaluate \exp{(+305.8)}, only \exp{(-305.8)}=0 which won’t overflow.

\log(1+\exp{(-z)}) = \log(\exp{(-z)}\times(1+\exp{(z)})) = -z + \log{(1+\exp{(z)})}

Cheers,
Raymond

futurejj · August 18, 2023, 3:45am

I have one query, shouldn’t using logits be the default behaviour of the softmax output in Tensorflow? As in, if using softmax leads to numerical inaccuracies, then in which scenario is it preferred over the logits way?

rmwkwok · August 18, 2023, 4:14am

I cannot think of a case where logits is not preferred.

Tensorflow is a flexible framework that allows you to go the logits way or not. There is no default to that particular setting.

However, logits is my default.

Cheers,
Raymond

Topic		Replies	Views
TensorFlow use of Z3 instead of A3 Improving Deep Neural Networks: Hyperparameter tun coursera-platform	2	624	May 10, 2022
Why doesn't forward_propagation contain the activation values? Improving Deep Neural Networks: Hyperparameter tun coursera-platform	7	500	February 3, 2023
Improved implementation of softmax - Neural network training \| Coursera Advanced Learning Algorithms week-module-2	1	68	June 25, 2024
Question about week 3 assignment Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	689	August 8, 2022
DLS Course2: Week 3 Exercise 6 (compute_total_loss method) Improving Deep Neural Networks: Hyperparameter tun coursera-platform	15	1852	July 31, 2024

What exactly does the improved implementation of softmax video mean?

Related topics