Numerical correct implementation of softmax

Hi Mentor,

For the case from_logits=True and having linear activation in the output layer, our doubt is how the sparse categorical cross entropy loss function computing softmax at the loss function ?

Also for the similar case of from_logits=True and linear activation in the output layer, how the binarycross entropy loss function computing sigmoid at the loss function ?

Its rounding my head… can someone please help to clarify ?

Hi, @Anbu !

The from_logits flag only refers to the diference of computing the loss function from probabilities or from logits. Probabilities are normalized - i.e. have a range between [0..1] . Logits aren’t normalized, and can have a range between [-inf...+inf].

As you have a linear activation function in the last layer, you should include from_logits=true

Hello @Anbu,

You may also want to read this post!

Raymond

1 Like

Thanks for the reply sir but my doubt is, sparsecategorialcrossentry loss function designed equivalent to === softmax loss function ? so sparsecategorial means softmax ?

Softmax is a function that converts output logits into probabilities. Softmax is not a loss function.

SparseCategoricalCrossEntropy is the loss function that computes how bad the predictions from the truth are.

When we have a N-class classification problem, we have a Dense layer as the output layer, which will produce N logits per sample. Now we have a choice to make:

  • if we use Softmax to convert logits into probabilities, we use SparseCategoricalCrossentropy(from_logits=False) to do the loss. We set from_logits to False because they are probabilities instead of logits.
  • if we do not use Softmax, then they remain logits, and we need to use SparseCategoricalCrossentropy(from_logits=True) to calculate the loss.

Cheers,
Raymond

Just to be 100% clear, the softmax is still happening (“being used”) in that case, but the point is that it is done internally within the loss function. The reason for doing it that way is that the numerical behavior of the algorithm is better (more stable) when they can integrate the softmax and the log loss calculations in the same logic as opposed to doing them as completely separate steps. It’s also less code for us to write and that’s the way Prof Ng always does it once we graduate to using TF.

1 Like

Also note that there are three different cross entropy loss functions provided by TF:

BinaryCrossentropy - this is for the case of a binary (yes/no) classification. In that case the output activation is sigmoid, but it still supports from_logits = True or False.

CategoricalCrossentropy - for the multiclass case, but when the labels are provided in “one hot” form. The activation is softmax in the multiclass case.

SparseCategoricalCrossentropy - for the multiclass case, but with the labels specified in categorical form, not one hot form, meaning they are the class numbers of the output classes. The activation is softmax here also.

In all three cases, you still have to choose which from_logits mode you want to use. Prof Ng recommends always using True mode.

Please read the documentation for the above three functions if you want more information.