Week 2 - Improved implementation with SoftMax

Yug_Desai_RA19110030 · November 23, 2023, 4:57pm

I am confused for why we used Sigmoid activation when Logits = True is set, shouldn’t it be Softmax activation?

lukmanaj · November 23, 2023, 5:10pm

Hi @Yug_Desai_RA19110030,
Sigmoid is usually used for binary classes (only two classes). You use Softmax when you have more than 2 classes.
The idea is to use logits=True with Sigmoid for binary classification. For multiclass classification, Softmax is used and Softmax gives a probability distribution over all the classes.

Yug_Desai_RA19110030 · November 23, 2023, 5:36pm

That clarifies the doubt, thank you

paulinpaloalto · November 23, 2023, 10:07pm

Just one further clarification here: the issue of from_logits is independent of whether we are doing binary or multiclass classification. In either type of classification, it makes more sense to use from_logits = True, which just means that whatever the activation function is (sigmoid or softmax) happens as a unified part of the cross entropy loss function calculations. Here’s a thread which explains why this mode is more advantageous. The “tl;dr” is that from_logits = True gives you answers that are closer to the real mathematically correct answers we would get if we could do the calculations using the real numbers \mathbb{R}, instead of just approximating things in floating point.

Yug_Desai_RA19110030 · November 24, 2023, 9:53am

Based on my understanding, can I say, considering an equation:
7*(22/7)

When we dont use, from_logits, ie,

from_logits = False,

the calculation for the equation:
7*3.1428571429 = 22.0000000003

Whereas, whem we use

from_logits = True,

the calculation for the equation, becomes how we approach it in Real Number, where we cut the numerator and denominator, to get:
7*(22/ 7) = 22 ?

paulinpaloalto · November 24, 2023, 4:12pm

The example you show in hand-coded python does show the inaccuracies that arise because of the finite nature of floating point representations. We literally cannot exactly represent even something as simple to express as \frac {1}{3}, so the answers are all approximations. What is happening with the difference between from_logits = True and from_logits = False is that in the True case, they get to choose computations that are better approximations of the cost, so that we end up with more accurate answers.

One other point to make here is that in order to use from_logits = True mode, the output of our network is the linear output of the last layer and does not include sigmoid or softmax. So when we use our network in “predict” mode, as opposed to training mode, we need to take the output and feed it through the appropriate activation to actually get the prediction answer.

Yug_Desai_RA19110030 · November 25, 2023, 11:02am

That helps, thank you for your explanation.

Javen_Li · November 30, 2023, 10:02am

Just wondering what’s the true difference here.

When setting it to true,

does it just use more higher precision data structure, e.g like BigDecimal vs double in java (just an example, I didn’t use Java for years, maybe inaccurate example)

Or does it use something like symbolic computation, (e.g. x = a / b * b, → x = a) so that some factor can be ellimated，thus it might achieve better precision?

paulinpaloalto · November 30, 2023, 3:53pm

They don’t use higher resolution, but it turns out that different algorithms have different properties w.r.t. how rounding errors propagate. Of course we are dealing with exponents of e for the softmax or sigmoid and logarithms for the cross entropy loss. One concrete example is that with either sigmoid or softmax, the values can “saturate” and round to exactly 0. or 1. and then you get NaN for the cost because log(0) is -\infty. When you’re doing both computations together, they can catch that case and just use a number very close to 0. or 1. so that the cost is an actual value.

This is real math. Google “numerical analysis” and once you find a good site, read the section about “error propagation”. Or if you want a concrete example, complete the experiment described on this thread and you’ll see that the answers do differ in the 7th decimal place.

Or you could look at the actual TF code to see what they do. I’ve never actually had the guts to do that, but it is Open Source, right?

paulinpaloalto · November 30, 2023, 4:09pm

As just a simple example of how different mathematically equivalent ways to express a given computation can give different results in floating point, try some experiments like this:

# experiment with error propagation
np.random.seed(42)
A = np.random.rand(3,5)
print(f"A.dtype = {A.dtype}")
print(f"A = {A}")
m = 7.
B = 1./m * A
C = A/m
D = (B == C)
print(f"D = {D}")
diff = (B - C)
diffNorm = np.linalg.norm(diff)
print(f"diffNorm = {diffNorm}")
print(f"diff = {diff}")

Running the above gives this:

A.dtype = float64
A = [[0.37454012 0.95071431 0.73199394 0.59865848 0.15601864]
 [0.15599452 0.05808361 0.86617615 0.60111501 0.70807258]
 [0.02058449 0.96990985 0.83244264 0.21233911 0.18182497]]
D = [[False  True False False  True]
 [ True False False False  True]
 [False  True  True False False]]
diffNorm = 2.9082498092558215e-17
diff = [[-6.93889390e-18  0.00000000e+00 -1.38777878e-17 -1.38777878e-17
   0.00000000e+00]
 [ 0.00000000e+00 -1.73472348e-18 -1.38777878e-17 -1.38777878e-17
   0.00000000e+00]
 [-4.33680869e-19  0.00000000e+00  0.00000000e+00 -3.46944695e-18
  -3.46944695e-18]]

Javen_Li · December 1, 2023, 12:22am

Thanks a lot for the detail clarification.

Topic		Replies	Views
Question about is_logit Advanced Learning Algorithms week-module-2	30	940	February 17, 2024
Improved implementation of softmax - Neural network training \| Coursera Advanced Learning Algorithms week-module-2	1	68	June 25, 2024
Week 3 compute_cross_entropy_cost Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	489	January 27, 2023
What exactly does the improved implementation of softmax video mean? Advanced Learning Algorithms week-module-2	9	819	August 18, 2023
Numerical correct implementation of softmax Advanced Learning Algorithms week-module-2	6	616	December 24, 2022

Week 2 - Improved implementation with SoftMax

Related topics