In the transfer learning exercise (week 2), when adding the Dense layers, why don’t we use a sigmoid activation although we are dealing with a binary classification problem ?
1 Like
That is because the TF/Keras loss function all support the mode in which we feed the linear activation output to the loss function and then let the loss function compute the activation and loss together. This is specified by the from_logits = True parameter. The reason for using the loss functions this way is that it is a) more efficient (one less call) and b) more numerically stable (it’s easier to deal with saturated sigmoid value for example). You’ll see that we always use that mode in both binary and categorical cases.
Please have a look at the documentation for the loss function if what I said above is not enough to fully explain this.
1 Like