Week 2 - Improved implementation with SoftMax

I am confused for why we used Sigmoid activation when Logits = True is set, shouldn’t it be Softmax activation?

1 Like

Hi @Yug_Desai_RA19110030,
Sigmoid is usually used for binary classes (only two classes). You use Softmax when you have more than 2 classes.
The idea is to use logits=True with Sigmoid for binary classification. For multiclass classification, Softmax is used and Softmax gives a probability distribution over all the classes.


That clarifies the doubt, thank you


Just one further clarification here: the issue of from_logits is independent of whether we are doing binary or multiclass classification. In either type of classification, it makes more sense to use from_logits = True, which just means that whatever the activation function is (sigmoid or softmax) happens as a unified part of the cross entropy loss function calculations. Here’s a thread which explains why this mode is more advantageous. The “tl;dr” is that from_logits = True gives you answers that are closer to the real mathematically correct answers we would get if we could do the calculations using the real numbers \mathbb{R}, instead of just approximating things in floating point.


Based on my understanding, can I say, considering an equation:

  1. When we dont use, from_logits, ie,

from_logits = False,

the calculation for the equation:
7*3.1428571429 = 22.0000000003

  1. Whereas, whem we use

from_logits = True,

the calculation for the equation, becomes how we approach it in Real Number, where we cut the numerator and denominator, to get:
7*(22/ 7) = 22 ?

The example you show in hand-coded python does show the inaccuracies that arise because of the finite nature of floating point representations. We literally cannot exactly represent even something as simple to express as \frac {1}{3}, so the answers are all approximations. What is happening with the difference between from_logits = True and from_logits = False is that in the True case, they get to choose computations that are better approximations of the cost, so that we end up with more accurate answers.

One other point to make here is that in order to use from_logits = True mode, the output of our network is the linear output of the last layer and does not include sigmoid or softmax. So when we use our network in “predict” mode, as opposed to training mode, we need to take the output and feed it through the appropriate activation to actually get the prediction answer.

That helps, thank you for your explanation.

Just wondering what’s the true difference here.

When setting it to true,

does it just use more higher precision data structure, e.g like BigDecimal vs double in java (just an example, I didn’t use Java for years, maybe inaccurate example)

Or does it use something like symbolic computation, (e.g. x = a / b * b, → x = a) so that some factor can be ellimated,thus it might achieve better precision?

They don’t use higher resolution, but it turns out that different algorithms have different properties w.r.t. how rounding errors propagate. Of course we are dealing with exponents of e for the softmax or sigmoid and logarithms for the cross entropy loss. One concrete example is that with either sigmoid or softmax, the values can “saturate” and round to exactly 0. or 1. and then you get NaN for the cost because log(0) is -\infty. When you’re doing both computations together, they can catch that case and just use a number very close to 0. or 1. so that the cost is an actual value.

This is real math. Google “numerical analysis” and once you find a good site, read the section about “error propagation”. Or if you want a concrete example, complete the experiment described on this thread and you’ll see that the answers do differ in the 7th decimal place.

Or you could look at the actual TF code to see what they do. I’ve never actually had the guts to do that, but it is Open Source, right?

As just a simple example of how different mathematically equivalent ways to express a given computation can give different results in floating point, try some experiments like this:

# experiment with error propagation
A = np.random.rand(3,5)
print(f"A.dtype = {A.dtype}")
print(f"A = {A}")
m = 7.
B = 1./m * A
C = A/m
D = (B == C)
print(f"D = {D}")
diff = (B - C)
diffNorm = np.linalg.norm(diff)
print(f"diffNorm = {diffNorm}")
print(f"diff = {diff}")

Running the above gives this:

A.dtype = float64
A = [[0.37454012 0.95071431 0.73199394 0.59865848 0.15601864]
 [0.15599452 0.05808361 0.86617615 0.60111501 0.70807258]
 [0.02058449 0.96990985 0.83244264 0.21233911 0.18182497]]
D = [[False  True False False  True]
 [ True False False False  True]
 [False  True  True False False]]
diffNorm = 2.9082498092558215e-17
diff = [[-6.93889390e-18  0.00000000e+00 -1.38777878e-17 -1.38777878e-17
 [ 0.00000000e+00 -1.73472348e-18 -1.38777878e-17 -1.38777878e-17
 [-4.33680869e-19  0.00000000e+00  0.00000000e+00 -3.46944695e-18

Thanks a lot for the detail clarification.