The key is in how you invoke the loss function. You have two choices: you can explicitly include sigmoid
or softmax
as the output layer activation (depending on whether it’s a binary or multiclass classification) or you can omit the output activation and use the from_logits = True
argument to tell the loss function to do the activation computation along with the loss internally. The two methods are logically equivalent, but the latter is more efficient: less code to write and it gives more accurate results. Here’s a thread which discusses that and explains more about it.
Mind you, I am not a mentor for this particular course, so I don’t know if the assignment here has any requirements for which way you implement it in this particular case. You’ll need to consult the instructions.