Would it be correct that the following implementation of logistic regression is better than just specifying “sigmoid” as the activation function for the output layer as Dr. Ng said?
If so, are there any situations where using the Sigmoid activation function would be advantageous?
It depends on what you mean by “better”.
Andrew recommends using a linear output with “from_logits = True” when you have multiple labels, because it does a couple of things.
- Automatically applies softmax.
- It has some mathematical or computational efficiencies.
Yes. For one, when you have only two labels or a true/false result.
One other point to make here: just to be accurate, the network you have implemented is not Logistic Regression. It is a Fully Connected network with 3 layers which does binary classification. Logistic Regression is essentially a trivial Neural Network with only the “output” layer and does binary classification.
I would also state the case differently: you’re using “sigmoid” at the output layer either way. It’s just a question of whether you explicitly include the “sigmoid” activation or whether you let it be handled internally within the cross entropy loss function (the from_logits = True
mode). Of course if you don’t explicitly add the “sigmoid” in the output layer, then you also have to add it explicitly in your “predict” logic.
So maybe you could argue that the one case in which explicitly adding “sigmoid” in the output layer is better is that it makes your predict logic simpler, if that goal is more important to you than the improved numerical accuracy gained by the other method. You could also try it both ways in a given case to see if the predictions are actually affected by any accuracy differences. It’s possible that in any given case it ends up not mattering that much to the results of training.
Here’s a thread which discusses why the from_logits = True
method is preferred. Here’s a thread from Raymond that goes into some depth in showing why the latter method is more accurate.
Thanks Paul. Is it correct to say that if performing binary classification, sigmoid needs to be specified in the output layer or the predict line of code? And utilizing the latter with a linear output layer, it’s more numerically stable?
In the general simple case, yes to both.
More stable than what?
The stability argument only applies for TensorFlow and an NN and the situation where you might think about using softmax() in the output layer.