Model Output with and without Softmax Activation / from_logits=True

In the lab “C2_W2_SoftMax” when I run neural network on tensorflow using softmax activation function in the output layer I get the following array as the output for a test example:
[5.06e-03 4.02e-03 9.65e-01 2.63e-02]

But when I change the activation function from softmax to linear for the final output layer and use “loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)” argument in model.compile, I get the following array as the output for the same test example:
[-2.44 -3.39 2.61 -1.35]

Is it right to say that the first array ( [5.06e-03 4.02e-03 9.65e-01 2.63e-02] ) were the probabilities a1 - a4 and the second array ( [-2.44 -3.39 2.61 -1.35] ) were the values of “z” z1 - z4?

Your intuition is right but in the case of linear, z is equal to a (z = a). So, with linear, z1 = a1-- z4 = a4. But this is not the case in softmax where a = g(z).

Does this mean that a1 is normalization of z1, and same for a2-a4?

What do you mean by “normalization”?

In the case of linear, a is exactly the same as z. For example, if z = 5, a is also 5. However, in the case of softmax, or any other activation function, z and a have different values.

Best,
Saif.

Understood what you said. By normalization I mean that for example we use standard deviation method to normalize data for models to train faster. Just like that we have value of “z” ( [-2.44 -3.39 2.61 -1.35] ) and they are normalized to the following values ( [5.06e-03 4.02e-03 9.65e-01 2.63e-02] ), which are between 0 and 1.

Yes, Softmax does normalize ‘z’ into probabilities. However, keep in mind that its primary objective is not general data normalization or standardization. Instead, it serves a specific purpose.

This explains that the largets value of “z” can determine the category without the use of “tf.nn.softmax(p_preferred).numpy()” function to convert “z” to “a”? (as demonstrated in the end of lab)

{Updated} My this reply is incorrect: I don’t think so. Linear activation output does not give any meaningful probabilities. For classification problems, we should use softmax or sigmoid in the output layer.

Correct reply: You are right, @Ammar_Jawed! Let’s say we are using Sigmoid, when the input value, z, is a large number, such as 10, the value of e^{-z} becomes very small (0.000045), making the denominator approximately equal to 1, resulting in a final output of 1. So, large value of z gives us 1 (large probability). On the other hand, when the input value is a small number, such as -10, the value of e^{-z} becomes very large (22026.46), making the denominator a very large number, resulting in a final output close to 0. So, small value of z gives us 0.

What I think is that let’s say you chose 2.61 as the largest value of ‘z’. Now ask yourself the question: Why did you choose that number? Is it because that number is the ‘largest’?

I think the term ‘largest’ or the number 2.61 does not have any meaning until you transform it into a probability. So, after transforming it into probability, the number ‘2.61’ can be clarified as having a probability of 9.65e-01 or a 96.5% likelihood, which represents the maximum likelihood of that category. So why not use tf.nn.softmax() to make it meaningful ?

Certainly, it’s a better practice to use tf.nn.softmax() to check the probability as it gives better intuition of the solution rather then checking the value of “z”, but wouldn’t it be an extra step which will require extra computation?

When training large neural networks, will it be a good practice to first compute probability and then choose category or just directly choose the category without computing probability, as demonstrated in the lab?

Using softmax to compute probabilities is a good practice. There is always a trade-off.

I think this lecture has also discussed your question!