For week3 programming assignment, Why are we not applying softmax activation function on the output layer for model method for every minibatch. This is the code where I am having the confusion. Why are we computing loss before applying softmax on Z3.
# 1. predict
Z3 = forward_propagation(tf.transpose(minibatch_X), parameters)
# 2. loss
minibatch_cost = compute_cost(Z3, tf.transpose(minibatch_Y))
You’re right that we don’t compute the activation on the output layer in forward propagation. This is a very standard way to do things: what happens is that we compute the sigmoid or softmax output activation as part of the cost computation. There is an argument from_logits that we use to tell the cost function to do that. See the documentation for the loss function, which they linked in the instructions of that portion of the assignment. The reason is that it is both more efficient and more numerically stable to do it that way: e.g. they can handle the “saturation” cases in an efficient way among other things. You’ll find that Prof Ng uses this method in all cases in which we are using TF to implement and train models. It’s less code and it works better, so what is not to like about that?