In the last layer may I use “relu” instead linear? It is not specified why it has to be “linear” in the class, is it because of convergence issues?
Thanks!
Right! Here’s a thread which discusses the reasons for using from_logits = True mode and more about what that means. And here’s one from Raymond that does a much more complete explanation of the math behind this.
I think in general, checking with chat gpt, why the final Dense function needs to be a linear one and not just positive values (like in the relu) is because there would be a bias in the output and we would lose important information regarding the output, but I guess this is true only if the output can assume negative values?
ReLU is not used as an output layer, because of its output characteristic.
Both SCC and BC are used for classification. You can either use sigmoid activation in the output layer, or a linear activation with “from_logits = True”.
The point is that you are doing a multiclass classification in this case, so the output layer activation is softmax, not ReLU. There is no purpose of adding ReLU to the mix. It’s possible you could get that to work by training, but it serves no purpose: you effectively would have two activations at the output layer (ReLU → softmax). What is the point of that?
Did you read the threads I linked? The point is that from_logits = True just says that the softmax (or sigmoid in the binary case) happens as part of the cost calculation.
But then your actual trained model has logits as the output, so you manually need to include the softmax in “predict” mode. Either that or just do argmax on the logit outputs, since softmax is monotonic.
@paulinpaloalto yes I read it. Using linear you are saying basically you are not using an activation function at all at the end basically, therefore you use softmax combined with the cost calculation for numerical stability? Thanks!
Yes, that’s correct. Although I would phrase it differently: softmax is the activation, it’s just that you don’t explicitly include it in the output layer. It gets applied in the loss calculation and then as an extra step when executing the model in “predict” mode.
When you see why logits = True is used as it helps in better stability for the softmax to work. You could have put softmax at the last layer and set logits = False but the more robust way and gives more stability is when you set logits = True and last layer as linear which computes softmax internally and is more good.