Why we just compute A1 and A2 with relu function and return fuction with Z3 and compute its cost. Why not A3 with *softmax classifier* since we are predict multi-class labels. Thanks in advance.

Thatâ€™s because they want you to use the *Binary_crossentropy* loss function with the *from_logits = True* argument, which causes the *sigmoid* calculation to be incorporated in the loss computation. That is preferred because they can manage numerical stability better when they do the two together. E.g. dealing with problems with â€śsaturatedâ€ť *sigmoid* output values. Hereâ€™s the doc page for Binary_crossentropy.

They donâ€™t really explain what the *from_logits = True* argument does in the assignment, but they do literally write out the code for you in the instructions. Maybe it would help if they explained it a bit more. Iâ€™ll suggest that.

Although now that you mention it, this is a multi-class problem, so maybe the loss function should actually be *categorical_crossentropy* in which case it would be the *softmax* calculation instead of *sigmoid* that is being done internally. I canâ€™t tell from the documentation if perhaps the binary version is smart enough to handle either case. Of course you can think of *softmax* as the multi-class generalization of *sigmoid* and the math is very similar. I will investigate further.

I was wondering about this too, @paulinpaloalto. *categorical_crossentropy* seemed like the obvious choice. *binary_crossentropy* works, but it wouldâ€™ve made more sense had this been a multi-label classification problem.

I guess whatâ€™s important is that @nikolafuseâ€™s intuition is correct and the reason why the last activation (be it *softmax* or *sigmoid*) doesnâ€™t have to be explicitly computed was clearly explained.

Update on this issue: in the notebook, they donâ€™t give you any logic to assess the results of the training, but I went ahead and cooked up my own version of that. It turns out that the training works fine and you get good prediction accuracy on both the train and test sets, but you really need to use Adam optimization (as the say in the instructions) as opposed to the SGD optimizer that they actually gave us in the code as written. I conclude from this that the *binary_crossentropy* logic in TF is smart enough to handle the fact that this is really not a binary classification and everything just works as expected.