Why is`from_logits=True` plus `activation=linear` more stable?

From C2 W2 “Improved implementation of softmax”, we know that, for a binary classification problem, the following approach A is more stable than the approach B:

Approach Output layer’s activation Loss Function
A “linear” `tf.keras.losses.BinaryCrossentropy(from_logits=True)`
B “sigmoid” `tf.keras.losses.BinaryCrossentropy(from_logits=False)`

This post will show the maths reason, and begin with the following slide:

The lecture replaces the middle equation (Approach B) with the bottom one (Approach A). With the bottom one, we never explicitly calculate any probabiliy value a out. However, one might argue otherwise because if we computed e^{-z}, then + 1, then took reciporcal, we were de facto computing a, weren’t we?

The fact is, we are not. The bottom equation is just the first step before a series of mathematical simplification, after which, we can see why we are not, and here come the steps:

In the underlined terms, log cancels out e such that log(e^{-z}) = -z. This is one of the two cores of this simplification, because e^{-z} is easy to overflow when -z is too large, think about what it is when -z = 10000?

The chance that it can overflow makes it unstable.

With those e^{-z} gone, it is one step closer to a more stable form.

However, we still have one e^{-z} remained, which can overflow when -z is large. Next, we will deal with that.

To begin with, be noted that e^{-z} can be unstable only when z < 0, so with the above maths, we end up with a stable e^{z}. Why stable? Because z < 0, so e^{z} is only going to be very small or approach to 0, thus not to overflow.

For the case of z \ge 0, the original form is already stable, so we keep it that way. To list them out:

They look similiar, right? In fact, we can combine them:

If you doubt the above combined formula, test it with some positive z's and negative z's and compare the results with the separate version

Note that max(0, z) means to take the larger value among 0 and z. It fact, max(0, z) is just another way to represent a ReLU function.

|z| means taking the absolute value of z. With e^{-|z|} as the only exponential term, it will not overflow.

Hope that, by now, you see how the maths simplify the original loss function to a form that cannot overflow and is more stable.

Cheers,
Raymond

7 Likes

Perfect explanation @rmwkwok!

1 Like

Thanks, @Honza_Zbirovsky!