Model Output with and without Softmax Activation / from_logits=True

Ammar_Jawed · June 1, 2023, 8:03am

In the lab “C2_W2_SoftMax” when I run neural network on tensorflow using softmax activation function in the output layer I get the following array as the output for a test example:
[5.06e-03 4.02e-03 9.65e-01 2.63e-02]

But when I change the activation function from softmax to linear for the final output layer and use “loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)” argument in model.compile, I get the following array as the output for the same test example:
[-2.44 -3.39 2.61 -1.35]

Is it right to say that the first array ( [5.06e-03 4.02e-03 9.65e-01 2.63e-02] ) were the probabilities a1 - a4 and the second array ( [-2.44 -3.39 2.61 -1.35] ) were the values of “z” z1 - z4?

saifkhanengr · June 1, 2023, 8:08am

Your intuition is right but in the case of linear, z is equal to a (z = a). So, with linear, z1 = a1-- z4 = a4. But this is not the case in softmax where a = g(z).

Ammar_Jawed · June 1, 2023, 8:11am

Does this mean that a1 is normalization of z1, and same for a2-a4?

saifkhanengr · June 1, 2023, 8:14am

What do you mean by “normalization”?

In the case of linear, a is exactly the same as z. For example, if z = 5, a is also 5. However, in the case of softmax, or any other activation function, z and a have different values.

Best,
Saif.

Ammar_Jawed · June 1, 2023, 8:17am

Understood what you said. By normalization I mean that for example we use standard deviation method to normalize data for models to train faster. Just like that we have value of “z” ( [-2.44 -3.39 2.61 -1.35] ) and they are normalized to the following values ( [5.06e-03 4.02e-03 9.65e-01 2.63e-02] ), which are between 0 and 1.

Mujassim_Jamal · June 1, 2023, 8:59am

Yes, Softmax does normalize ‘z’ into probabilities. However, keep in mind that its primary objective is not general data normalization or standardization. Instead, it serves a specific purpose.

Ammar_Jawed · June 1, 2023, 9:25am

This explains that the largets value of “z” can determine the category without the use of “tf.nn.softmax(p_preferred).numpy()” function to convert “z” to “a”? (as demonstrated in the end of lab)

saifkhanengr · June 1, 2023, 9:42am

{Updated} My this reply is incorrect: I don’t think so. Linear activation output does not give any meaningful probabilities. For classification problems, we should use softmax or sigmoid in the output layer.

Correct reply: You are right, @Ammar_Jawed! Let’s say we are using Sigmoid, when the input value, z, is a large number, such as 10, the value of e^{-z} becomes very small (0.000045), making the denominator approximately equal to 1, resulting in a final output of 1. So, large value of z gives us 1 (large probability). On the other hand, when the input value is a small number, such as -10, the value of e^{-z} becomes very large (22026.46), making the denominator a very large number, resulting in a final output close to 0. So, small value of z gives us 0.

Mujassim_Jamal · June 1, 2023, 10:32am

What I think is that let’s say you chose 2.61 as the largest value of ‘z’. Now ask yourself the question: Why did you choose that number? Is it because that number is the ‘largest’?

I think the term ‘largest’ or the number 2.61 does not have any meaning until you transform it into a probability. So, after transforming it into probability, the number ‘2.61’ can be clarified as having a probability of 9.65e-01 or a 96.5% likelihood, which represents the maximum likelihood of that category. So why not use tf.nn.softmax() to make it meaningful ?

Ammar_Jawed · June 1, 2023, 2:38pm

Certainly, it’s a better practice to use tf.nn.softmax() to check the probability as it gives better intuition of the solution rather then checking the value of “z”, but wouldn’t it be an extra step which will require extra computation?

When training large neural networks, will it be a good practice to first compute probability and then choose category or just directly choose the category without computing probability, as demonstrated in the lab?

saifkhanengr · June 1, 2023, 3:09pm

Using softmax to compute probabilities is a good practice. There is always a trade-off.

rmwkwok · June 1, 2023, 8:50pm

I think this lecture has also discussed your question!

Topic		Replies	Views
Improved Implementation of softmax Advanced Learning Algorithms week-module-2	1	404	June 27, 2023
Why use Softmax instead of a linear transform that sums to 1? Neural Networks and Deep Learning coursera-platform	5	954	May 28, 2021
Programming Assigment: Softmax activation is not applied Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	515	January 2, 2022
Softmax implementation Advanced Learning Algorithms week-module-2	6	558	May 11, 2023
Week2 softmax function Advanced Learning Algorithms week-module-2	1	40	May 17, 2025

Model Output with and without Softmax Activation / from_logits=True

Related topics