Softmax implementation


Why did he scrapped softmax and changed it to linear?

It would be better if you give us a link of this video (or at least a video name with the proper week number). So, we can watch and tell you why he changed it to linear.

TensorFlow works slightly better if you use a linear output and “from_logits = True”, rather than using softmax directly in the output layer.

Internally, the “from_logits = True” parameter causes the compile process to implement the softmax automatically.

I didn’t get u…I mean if we change activation from softmax to linear then how will softmax function be implemented?..Can u please elaborate a little more

Advanced algorithm …week 2…
Name of video-“Improved implementation of softmax”

The softmax activation will be implemented by TensorFlow during the compile phase.

Hi @gigaGPT,

There are two ways to compute the softmax activation, which are:

  1. When using softmax activation in the last layer:

    • The logits (z) are computed in this layer which are the raw, unnormalized outputs.
    • Softmax activation is then applied to these logits to obtain the activations (a) for each class.
    • The loss can be computed directly using these activations (a) with an appropriate loss function, such as sparse categorical cross-entropy in this case.
  2. When using linear activation in the last layer:

    • Similarly, the logits (z) are computed in the last layer which are again the raw, unnormalized outputs.
    • Instead of applying softmax activation to obtain activations (a), the logits (z) are used directly.
    • Setting from_logits=True in the compile step instructs TensorFlow to internally apply softmax to the logits (z), as Tom described.

So, the only difference between this two approaches is that in the first approach, softmax activation is computed separately before calculating sparse categorical cross-entropy loss. In the second approach, softmax activation is computed within the sparse cross-entropy loss as given in the slide and this approach is suitable because it prevents numerical instability and avoids potential rounding errors that can occur when computing the softmax activation separately.