Purpose of using numerically accurate implementation of softmax

Bio_J · March 3, 2024, 8:18am

In the video Andrew mentioned that the reason we use
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
is to avoid the large and small number due to e to the power of z, because that code will change the output layer to linear activation function instead of softmax, and the output will be z1-z10(wx+b which is linear’s a) instead of a1-a10.

But my confusion is why in the end of lab C2_W2_SoftMax, we called softmax function to calculate the possibilities?
sm_preferred = tf.nn.softmax(p_preferred).numpy()

Isn’t this is against the purpose of the numerically accurate implementation? Since it will still perform calculation of e to the power of z, how can this avoid the large and small number?

TMosh · March 3, 2024, 8:53am

Can you say where you heard this reason? Because I don’t believe it is the correct explanation.

rmwkwok · March 3, 2024, 9:10am

Hello @Bio_J,

So you know that the model does not produce probabilities.

Just because we want the probability values and the model does not produce it.

You need to pay more attention to what problem this approach is trying to avoid, but putting this aside for now, the fact is that the approach has avoided the problem during training stage, otherwise any error caused by the problem will be propagated back to the network during weights update.

However, this is for the sake of network training only, and we don’t worry about inference time, because at inference time, no error will be propagated back.

Cheers,
Raymond

Bio_J · March 4, 2024, 3:54am

In the week2 Improved implementation of softmax

Bio_J · March 4, 2024, 3:57am

Hi Raymond, thx for the update, so basically, this approach is trying to avoid the small round off error to adds up during the network training? And since in the end we just use z1-z10 one time to calculate the possibility, so this won’t affect much?

rmwkwok · March 4, 2024, 5:02am

Yup, @Bio_J, some of the very small round off error, and, when -z becomes large, the overflow problem in e^{-z}.

I said some of the very small round off error, because, as I explained in this post, the “numerically accurate implementation” gives us a mathematical simplification from

to

And we still have one exponential term left.

Cheers,
Raymond

Topic		Replies	Views
Tensorflow avoid /accurate? Advanced Learning Algorithms week-module-2	1	335	December 18, 2023
What exactly does the improved implementation of softmax video mean? Advanced Learning Algorithms week-module-2	9	819	August 18, 2023
https://www.coursera.org/learn/advanced-learning-algorithms/lecture/Tyil1/improved-implementation-of-softmax Advanced Learning Algorithms week-module-2	1	45	June 30, 2024
Improved Implementation of Softmax - Trouble Understanding the Logic Advanced Learning Algorithms week-module-2	4	40	August 18, 2024
C2 W2 softmax lab not using softmax activation Advanced Learning Algorithms week-module-2	4	532	March 30, 2023

Purpose of using numerically accurate implementation of softmax

Related topics