Why don’t we add a sigmoid activation function at our prediction layer?

Also, why do we have only one neuron and not two since we have two classes?

I am referencing our notebook for my own project, I have 3 classes in my dataset and this is my code:

I am trying to use ResNet50

Sorry if this question sounds silly and if the answer is obvious

Please look at the comment about the dense layer with 1 unit. It says right there that for a binary classfication problem 1 unit is sufficient.

The loss function for the model has a `from_logits`

flag set to True. So the loss is computed after converting the output of a dense layer to probability scale. See here for the documentation.

For a multi class prediction (classes > 2), the output layer has to have number of units equal to number of classes.

1 Like

Note that the point Balaji makes about *from_logits = True* mode for the loss function also applies in the cases in which we have a multiclass output. The way you have written the code with the explicit *softmax* activation is the other way to do it. Your method is also correct, but the method of bundling the activation with the loss calculation is preferred because it is more numerically stable.

So instead of sotftmax and categorical cross entropy as a loss function, I could have used categorical cross entropy only without softmax?

If you don’t use softmax in the output layer, specify `from_logits=True`

in the loss function. See this link.

Ohhhhh okay, If Softmax is used, then the y_pred is turned into a probability distribution, if Softmax is not used then y_pred is logits tensor, right? This is interesting, am going to test it right now! Thank you! @balaji.ambresh

That is correct. Here is more on logit and its relationship to probability scale.