Why Softmax function?

The trick is the from_logits = True argument to the BCE loss function. That means that the softmax is incorporated with the loss function for better stability. That is explained on this thread. But that does mean that you need to manually apply softmax when you make predictions with the final model after the training is complete. But note that softmax is monotonic, so the largest input will produce the largest output. So you can see which class is the prediction even without softmax.

So you actually are using softmax, but not by including it directly in the output layer of your defined network architecture.

1 Like