Why we just compute A1 and A2 with relu function and return fuction with Z3 and compute its cost. Why not A3 with softmax classifier since we are predict multi-class labels. Thanks in advance.
That’s because they want you to use the Binary_crossentropy loss function with the from_logits = True argument, which causes the sigmoid calculation to be incorporated in the loss computation. That is preferred because they can manage numerical stability better when they do the two together. E.g. dealing with problems with “saturated” sigmoid output values. Here’s the doc page for Binary_crossentropy.
They don’t really explain what the from_logits = True argument does in the assignment, but they do literally write out the code for you in the instructions. Maybe it would help if they explained it a bit more. I’ll suggest that.
Although now that you mention it, this is a multi-class problem, so maybe the loss function should actually be categorical_crossentropy in which case it would be the softmax calculation instead of sigmoid that is being done internally. I can’t tell from the documentation if perhaps the binary version is smart enough to handle either case. Of course you can think of softmax as the multi-class generalization of sigmoid and the math is very similar. I will investigate further.
I was wondering about this too, @paulinpaloalto. categorical_crossentropy seemed like the obvious choice. binary_crossentropy works, but it would’ve made more sense had this been a multi-label classification problem.
I guess what’s important is that @nikolafuse’s intuition is correct and the reason why the last activation (be it softmax or sigmoid) doesn’t have to be explicitly computed was clearly explained.
Update on this issue: in the notebook, they don’t give you any logic to assess the results of the training, but I went ahead and cooked up my own version of that. It turns out that the training works fine and you get good prediction accuracy on both the train and test sets, but you really need to use Adam optimization (as the say in the instructions) as opposed to the SGD optimizer that they actually gave us in the code as written. I conclude from this that the binary_crossentropy logic in TF is smart enough to handle the fact that this is really not a binary classification and everything just works as expected.