Why is`from_logits=True` plus `activation=linear` more stable?

rmwkwok · August 30, 2023, 2:32am

From C2 W2 “Improved implementation of softmax”, we know that, for a binary classification problem, the following approach A is more stable than the approach B:

Approach	Output layer’s activation	Loss Function
A	“linear”	`tf.keras.losses.BinaryCrossentropy(from_logits=True)`
B	“sigmoid”	`tf.keras.losses.BinaryCrossentropy(from_logits=False)`

This post will show the maths reason, and begin with the following slide:

The lecture replaces the middle equation (Approach B) with the bottom one (Approach A). With the bottom one, we never explicitly calculate any probabiliy value a out. However, one might argue otherwise because if we computed e^{-z}, then + 1, then took reciporcal, we were de facto computing a, weren’t we?

The fact is, we are not. The bottom equation is just the first step before a series of mathematical simplification, after which, we can see why we are not, and here come the steps:

In the underlined terms, log cancels out e such that log(e^{-z}) = -z. This is one of the two cores of this simplification, because e^{-z} is easy to overflow when -z is too large, think about what it is when -z = 10000?

The chance that it can overflow makes it unstable.

With those e^{-z} gone, it is one step closer to a more stable form.

However, we still have one e^{-z} remained, which can overflow when -z is large. Next, we will deal with that.

To begin with, be noted that e^{-z} can be unstable only when z < 0, so with the above maths, we end up with a stable e^{z}. Why stable? Because z < 0, so e^{z} is only going to be very small or approach to 0, thus not to overflow.

For the case of z \ge 0, the original form is already stable, so we keep it that way. To list them out:

They look similiar, right? In fact, we can combine them:

If you doubt the above combined formula, test it with some positive z's and negative z's and compare the results with the separate version

Note that max(0, z) means to take the larger value among 0 and z. It fact, max(0, z) is just another way to represent a ReLU function.

|z| means taking the absolute value of z. With e^{-|z|} as the only exponential term, it will not overflow.

Hope that, by now, you see how the maths simplify the original loss function to a form that cannot overflow and is more stable.

Cheers,
Raymond

Honza_Zbirovsky · December 18, 2023, 12:16pm

Perfect explanation @rmwkwok!

rmwkwok · December 18, 2023, 2:00pm

Thanks, @Honza_Zbirovsky!

Topic		Replies	Views
https://www.coursera.org/learn/advanced-learning-algorithms/lecture/Tyil1/improved-implementation-of-softmax Advanced Learning Algorithms week-module-2	1	45	June 30, 2024
Improved implementation of softmax - Neural network training \| Coursera Advanced Learning Algorithms week-module-2	1	68	June 25, 2024
What exactly does the improved implementation of softmax video mean? Advanced Learning Algorithms week-module-2	9	819	August 18, 2023
Question about is_logit Advanced Learning Algorithms week-module-2	30	940	February 17, 2024
C2 W2 softmax lab not using softmax activation Advanced Learning Algorithms week-module-2	4	532	March 30, 2023

Why is`from_logits=True` plus `activation=linear` more stable?

Related topics