In Exercise 6 (compute_cost), the instruction is as follows: It's important to note that the "y_pred" and "y_true" inputs of tf.keras.losses.categorical_crossentropy are expected to be of shape (number of examples, num_classes).
When I write the code as expected using tf.keras.losses.categorical_crossentropy, it fails the test, but when I tried the softmax function tf.nn.softmax_cross_entropy_with_logits, I got the correct answers.
Can you please tell me why this happens? or report/correct if it is a bug.
transpose of lotgits using tf.transpose
I achieve 0.8 as the output cost, but the correct one should be 0.4
Can you please give me some hints if I did something wrong?
Thanks in advance.
That’s correct, but if you are passing the logits value, then you also have to pass the correct value for the from_logits parameter. Did you read that section of the documentation as @nramon suggested? Using the default value for from_logits will not work.
It sounds like we are talking at cross purposes here. Here’s the key point that seems like it’s not getting across: there are two ways to invoke either the binary cross entropy loss function or the categorical cross entropy loss function. You can pass the logits as the prediction input or you can pass the actual activation outputs (sigmoid or softmax) as the prediction inputs. But the algorithm can’t tell which you are doing, right? You have to tell it which form you are using as the predictions. You tell it that using the parameter from_logits, which is an optional “named” parameter with the default value of False. If you are passing logits, then you need to set the from_logits flag to True.
Great! You may well ask why they go to all that trouble. It turns out that computing the activation and the loss together in one step allows them to get better numerical stability and handle some outlier cases like NaN values from saturated sigmoid outputs more simply and cleanly. It’s also one less TF function call, so you’ll see that we always use the “from_logits = True” mode in these courses. If it’s one less call and it works better, what is not to like about that?
The numerical stability is the reason for which activation and loss are calculated at the same time!
I used to think the softmax calculation and the loss at different steps was weird since you could not see the probability distribution of the classes was not visible. But it made sense once I read about the stability!
Also, if one wants to find the prob. dist., one can put a softmax on the logits of the model anytime!
Right! You need to have the logic to apply softmax to the logits when you’re using the model in “prediction” mode as opposed to “training” mode anyway.