Numerical correct implementation of softmax

Anbu · December 24, 2022, 12:59pm

Hi Mentor,

For the case from_logits=True and having linear activation in the output layer, our doubt is how the sparse categorical cross entropy loss function computing softmax at the loss function ?

Also for the similar case of from_logits=True and linear activation in the output layer, how the binarycross entropy loss function computing sigmoid at the loss function ?

Its rounding my head… can someone please help to clarify ?

alvaroramajo · December 24, 2022, 3:09pm

Hi, @Anbu !

The from_logits flag only refers to the diference of computing the loss function from probabilities or from logits. Probabilities are normalized - i.e. have a range between [0..1] . Logits aren’t normalized, and can have a range between [-inf...+inf].

As you have a linear activation function in the last layer, you should include from_logits=true

rmwkwok · December 24, 2022, 3:16pm

Hello @Anbu,

You may also want to read this post!

Raymond

Anbu · December 24, 2022, 4:35pm

Thanks for the reply sir but my doubt is, sparsecategorialcrossentry loss function designed equivalent to === softmax loss function ? so sparsecategorial means softmax ?

rmwkwok · December 24, 2022, 5:01pm

Softmax is a function that converts output logits into probabilities. Softmax is not a loss function.

SparseCategoricalCrossEntropy is the loss function that computes how bad the predictions from the truth are.

When we have a N-class classification problem, we have a Dense layer as the output layer, which will produce N logits per sample. Now we have a choice to make:

if we use Softmax to convert logits into probabilities, we use SparseCategoricalCrossentropy(from_logits=False) to do the loss. We set from_logits to False because they are probabilities instead of logits.
if we do not use Softmax, then they remain logits, and we need to use SparseCategoricalCrossentropy(from_logits=True) to calculate the loss.

Cheers,
Raymond

paulinpaloalto · December 24, 2022, 6:07pm

Just to be 100% clear, the softmax is still happening (“being used”) in that case, but the point is that it is done internally within the loss function. The reason for doing it that way is that the numerical behavior of the algorithm is better (more stable) when they can integrate the softmax and the log loss calculations in the same logic as opposed to doing them as completely separate steps. It’s also less code for us to write and that’s the way Prof Ng always does it once we graduate to using TF.

paulinpaloalto · December 24, 2022, 6:15pm

Also note that there are three different cross entropy loss functions provided by TF:

BinaryCrossentropy - this is for the case of a binary (yes/no) classification. In that case the output activation is sigmoid, but it still supports from_logits = True or False.

CategoricalCrossentropy - for the multiclass case, but when the labels are provided in “one hot” form. The activation is softmax in the multiclass case.

SparseCategoricalCrossentropy - for the multiclass case, but with the labels specified in categorical form, not one hot form, meaning they are the class numbers of the output classes. The activation is softmax here also.

In all three cases, you still have to choose which from_logits mode you want to use. Prof Ng recommends always using True mode.

Please read the documentation for the above three functions if you want more information.

Topic		Replies	Views
Softmax implementation Advanced Learning Algorithms week-2	6	528	May 11, 2023
Week 3 - Assignment - compute_total_loss - try to set from_logits=False Improving Deep Neural Networks: Hyperparameter tun coursera-platform	5	15833	July 23, 2023
Improved implementation of softmax - Neural network training \| Coursera Advanced Learning Algorithms week-2	1	67	June 25, 2024
C2_W2_SoftMax lab Advanced Learning Algorithms week-2	5	234	March 20, 2024
Week 2 - Improved implementation with SoftMax Advanced Learning Algorithms week-2	10	711	December 1, 2023

Numerical correct implementation of softmax

Related topics