I have one more question-
Q2- if we dont use softmax in the last layer for better computation then how does our model verify output for each epoch and train the weights?
Tarun refer the below link to understand why from_logits=true holds significance in loss i.r.t.softmax activation when not used in the last dense layer in model architecture.
So what you guys are saying is if the final layer is linear activation, and we do
from_logits = True.
Then the loss function will expect the logits ranging from (-infinity,+infinity) and internally apply softmax for the calculation of loss from y_train which ranges from [0,N])
(Deepti thanks for the article)
No the understanding should be because we are not using softmax activation in the last dense layer, the loss should include the from_logits to include the raw logits, i.e. softmax activation as the loss choice here is SparseCategoricalcrossentropy which is compatible with softmax activation and Sparsecategoricalcrossentropy is not right choice for linear or sigmoid activation in the last dense layer.