It’s an interesting point! Have a look at the documentation for the different types of the “cross entropy” loss functions in TF and read the description of the from_logits parameter. Here’s the docpage for binary cross entropy and here’s the categorical case for multiclass networks. What you find is that it is an option that the TF loss functions give you to input either the linear activation outputs (Z3 in this example) or the full outputs after the activation function has been applied (A3). The reason that they give this option is that it turns out to be both more efficient (less code) and more numerically stable to implement the activation function and the loss calculation together. One obvious example of what is better about that is that you can handle the case of “saturated” sigmoid or softmax output values. In floating point, the output of sigmoid can round to exactly 0 or 1, even though it never really is 0 or 1 mathematically. If you look at the loss formula, you can see why that would be a problem: you end up with a NaN value for the cost in that case.
I forget whether Prof Ng ever explains this in the lectures anywhere, but what you will find is that we always use the “from_logits = True” mode in these courses. It’s less code for us to write and it works better, so what is not to like about that?