Math behind "tf.keras.metrics.categorical_crossentropy"


I accidentally delete my previous post, so I rewrite my question again. If you found the same post, feel free to delete it.

I did the math my way, but it doesn’t match with the final result. Can someone guide me here? Thank you.

Q1, in those (2,6) matrix, there 2 samples, and 6 classes. my understanding is right?

Q2, I think the cost is J = 1 / 2 ( - 1 log 6.148 - 1 log 5.033), but this is not equals to the solution 0.810287. I think I do not understand this part.

Q3. In the document of " tf.keras.metrics.categorical_crossentropy" , it says “from_logits: Whether y_pred is expected to be a logits tensor. By default, we assume that y_pred encodes a probability distribution.” what’s meaning of “encodes a probability distribution” ? Thanks.

The problem here is precisely that our “logits” input is not a probability distribution, by which they mean the output of the softmax activation function. Look at how we built the forward propagation in the earlier part of the exercise: there is no activation function at the output layer. That’s why you get values like 6.xxx and 5.xxxx which gives you crazy wrong values if you just apply the cross entropy loss to those values.

That’s why we need to tell the loss function that the inputs are “logits” and not a probability distribution. You use the from_logits argument to accomplish that. Prof Ng does not really discuss this as I recall, but the reason for doing it this way is that it is both more efficient and more numerically stable to allow the loss function to do both the softmax (or sigmoid in the binary case) and the log loss computation as a “bundled” operation. E.g. it’s easy to handle the “saturation” case in which the output rounds to exactly 0 or 1. That never happens mathematically, but in floating point it actually can and it makes a mess, since the loss ends up being NaN in that case. Once we switch to using TensorFlow, Prof Ng always does it this way, meaning we never include the activation function at the output layer in a classification problem.

Hi Paul, after applying softmax, I hand calculated the loss result: loss = [0.25361034 0.5566767 ], which is exactly the same value from my code

“tf.keras.losses.categorical_crossentropy(tf.transpose(labels),tf.transpose(logits), from_logits=True)”,

However, in the picture below, you see the loss = [0.25361034 0.5566767 ], then cost = tf.reduce_sum(loss) = 0.25361034+0.5566767 = final result 0.810287. I thought I should do 1/m * sum (loss) = 1/ 2 * sum(loss). I don’t see 1/2 there.

In the tester function, I do see " assert (np.abs(result - (0.50722074 + 1.1133534) / 2.0) < 1e-7)", this tells me my code for loss is still wrong? Thank you.


They updated this notebook recently so that it manages the costs in the way needed for handling minibatches, as we did in the optimization assignment in Week 2. The compute_cost returns the sum of the cost across the given samples. You accumulate the total over all the minibatches and then you only divide by m at the end of the epoch. Check the logic in the model function that comes next to see what I mean.

So there must be something else wrong, although your code as shown looks right to me. I have not had time to update to the new notebook yet, so I can’t really help until I get back from vacation probably. At least at the rate that I’m going at this point …

no worries. have a great vacation.