Purpose of using numerically accurate implementation of softmax

In the video Andrew mentioned that the reason we use
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
is to avoid the large and small number due to e to the power of z, because that code will change the output layer to linear activation function instead of softmax, and the output will be z1-z10(wx+b which is linear’s a) instead of a1-a10.

But my confusion is why in the end of lab C2_W2_SoftMax, we called softmax function to calculate the possibilities?
sm_preferred = tf.nn.softmax(p_preferred).numpy()

Isn’t this is against the purpose of the numerically accurate implementation? Since it will still perform calculation of e to the power of z, how can this avoid the large and small number?
image

Can you say where you heard this reason? Because I don’t believe it is the correct explanation.

Hello @Bio_J,

So you know that the model does not produce probabilities.

Just because we want the probability values and the model does not produce it.

You need to pay more attention to what problem this approach is trying to avoid, but putting this aside for now, the fact is that the approach has avoided the problem during training stage, otherwise any error caused by the problem will be propagated back to the network during weights update.

However, this is for the sake of network training only, and we don’t worry about inference time, because at inference time, no error will be propagated back.

Cheers,
Raymond

1 Like


In the week2 Improved implementation of softmax

Hi Raymond, thx for the update, so basically, this approach is trying to avoid the small round off error to adds up during the network training? And since in the end we just use z1-z10 one time to calculate the possibility, so this won’t affect much?

1 Like

Yup, @Bio_J, some of the very small round off error, and, when -z becomes large, the overflow problem in e^{-z}.

I said some of the very small round off error, because, as I explained in this post, the “numerically accurate implementation” gives us a mathematical simplification from

image

to

image

And we still have one exponential term left. :wink:

Cheers,
Raymond

1 Like