Why is`from_logits=True` plus `activation=linear` more stable?

From C2 W2 “Improved implementation of softmax”, we know that, for a binary classification problem, the following approach A is more stable than the approach B:

Approach Output layer’s activation Loss Function
A “linear” tf.keras.losses.BinaryCrossentropy(from_logits=True)
B “sigmoid” tf.keras.losses.BinaryCrossentropy(from_logits=False)

This post will show the maths reason, and begin with the following slide:

The lecture replaces the middle equation (Approach B) with the bottom one (Approach A). With the bottom one, we never explicitly calculate any probabiliy value a out. However, one might argue otherwise because if we computed e^{-z}, then + 1, then took reciporcal, we were de facto computing a, weren’t we?

The fact is, we are not. The bottom equation is just the first step before a series of mathematical simplification, after which, we can see why we are not, and here come the steps:

In the underlined terms, log cancels out e such that log(e^{-z}) = -z. This is one of the two cores of this simplification, because e^{-z} is easy to overflow when -z is too large, think about what it is when -z = 10000?

The chance that it can overflow makes it unstable.

With those e^{-z} gone, it is one step closer to a more stable form.

However, we still have one e^{-z} remained, which can overflow when -z is large. Next, we will deal with that.

image

To begin with, be noted that e^{-z} can be unstable only when z < 0, so with the above maths, we end up with a stable e^{z}. Why stable? Because z < 0, so e^{z} is only going to be very small or approach to 0, thus not to overflow.

For the case of z \ge 0, the original form is already stable, so we keep it that way. To list them out:

They look similiar, right? In fact, we can combine them:

If you doubt the above combined formula, test it with some positive z's and negative z's and compare the results with the separate version :wink:

Note that max(0, z) means to take the larger value among 0 and z. It fact, max(0, z) is just another way to represent a ReLU function.

|z| means taking the absolute value of z. With e^{-|z|} as the only exponential term, it will not overflow.

Hope that, by now, you see how the maths simplify the original loss function to a form that cannot overflow and is more stable.

Cheers,
Raymond

7 Likes

Perfect explanation @rmwkwok!

1 Like

Thanks, @Honza_Zbirovsky!