You are right that the outputs are Z3, the “logits”, meaning the linear activation outputs of the last layer, as opposed to the full activation outputs (A3). But that is not the reason for the transpose: the activation functions are always applied “elementwise”, so they don’t change the dimensions or orientation of the data. It just turns out that they defined the network to take data that is oriented as n_x x m where n_x is the number of features and m is the number of samples. That’s the way Prof Ng chose to orient the data in Course 1 and earlier in Course 2. But it turns out that the TensorFlow functions we are now switching to using assume that the “samples” dimension is the first dimension. That is why you have to do the transpose to get the m as the first dimension. They mention this in the instructions for the compute_cost section of the assignment.
Note that the reason the network outputs the logits instead of the activation outputs is that Prof Ng choose to use the from_logits = True
mode of the various cross entropy loss functions here. This is the way it will be whenever we’re using TF. Here’s another recent thread which discusses that.