In the video Andrew mentioned that the reason we use
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
is to avoid the large and small number due to e to the power of z, because that code will change the output layer to linear activation function instead of softmax, and the output will be z1-z10(wx+b which is linear’s a) instead of a1-a10.

But my confusion is why in the end of lab C2_W2_SoftMax, we called softmax function to calculate the possibilities?
sm_preferred = tf.nn.softmax(p_preferred).numpy()

Isn’t this is against the purpose of the numerically accurate implementation? Since it will still perform calculation of e to the power of z, how can this avoid the large and small number?

So you know that the model does not produce probabilities.

Just because we want the probability values and the model does not produce it.

You need to pay more attention to what problem this approach is trying to avoid, but putting this aside for now, the fact is that the approach has avoided the problem during training stage, otherwise any error caused by the problem will be propagated back to the network during weights update.

However, this is for the sake of network training only, and we don’t worry about inference time, because at inference time, no error will be propagated back.

Hi Raymond, thx for the update, so basically, this approach is trying to avoid the small round off error to adds up during the network training? And since in the end we just use z1-z10 one time to calculate the possibility, so this won’t affect much?

Yup, @Bio_J, some of the very small round off error, and, when -z becomes large, the overflow problem in e^{-z}.

I said some of the very small round off error, because, as I explained in this post, the “numerically accurate implementation” gives us a mathematical simplification from