I noticed that for Transfer Learning assignment we are not using a Sigmoid activation for the last layer. Is this because the network has many non-linear layers so the ability to learn a complex function is not compromised?
I was expecting to use a Sigmoid activation in the last layer since it’s a binary classification problem. I realize that the we are using the binary cross entropy loss function so we will anyhow encourage the neuron to produce a result between [0, 1].
Can you explain briefly under what condition using a linear activation at the last layer is worse than a Sigmoid? Btw, I am aware of Sigmoid’s saturation problem.
It’s a little more subtle than that: we actually are using a sigmoid activation at the output layer, but we do it as part of the loss function. Note that we use the
from_logits = True argument to the cross entropy loss function.
Here’s a thread which explains why we do it that way. It will always work this way and we first encountered this back in DLS C2 W3 in the TensorFlow Introduction assignment, which is what that thread is discussing.
Then the additional point is that when you want to use the trained model to make a prediction, you either manually add the sigmoid or change the interpretation of the output to say > 0 means True.
Ah yes, I remember now. Thanks for the quick reply, I greatly appreciate it!