I use the following cost function in building my first neural network in TensorFlow
cost = tf.reduce_sum(tf.keras.losses.categorical_crossentropy(tf.transpose(logits), tf.transpose(labels)))
But it does not match with the test values:
what I get: tf.Tensor(226.88997, shape=(), dtype=float32)
Expected output: ```
tf.Tensor(0.810287, shape=(), dtype=float32)
I am not sure what I am implementing wrong, can anyone help?
Additional notes: This is from W3 assignment of [
## Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
](https://www.coursera.org/learn/deep-neural-network/home/welcome)
I actually did spot something, I am not sure if it is an error and if it is not an error, I guess I missing something then.
in the programming assignment of course 2, week 3 where we got to implement the cost function, in the preceding text that described the assignment, it was noted that the inputs of tf.keras.losses.categorical_crossentropy are expected to be of shape (number of examples, num_classes). but the output of Z3 is of shape (number of classes, number of samples). I think the appropriate for both labels and logits should be of shape (number of classes, number of samples).
Have you added reduce_sum and note that labels come before logits i.e. tf.keras.losses.categorical_crossentropy(y_true, y_pred) and NOT tf.keras.losses.categorical_crossentropy(y_pred, y_true)
Thanks! Yes, I tried it just now. The output is closer to the expected output but not 100% match yet.
cost = tf.reduce_sum(tf.keras.losses.categorical_crossentropy(labels, logits))
It’s the same for me too, I guess I just had to submit it to the grader and made do with 80%… I already sent a topic to be able to address if the grader has an error or I am missing something, you could upvote the question…
I answered on your other thread about this. There are two problems (once you solved the issue with the order of the arguments):
You do actually need those transposes.
You also need to tell the cost function that the “logits” input is raw linear outputs and not softmax outputs. That is done by using the from_logits argument to the cost function. The default value of that optional parameter is False, but that is not the appropriate value in this case.
As explained in the DocString, let’s keep in mind that the shapes of logits and labels are both (6, num_examples), which means one sample is represented by one column.
However, by default, tf.keras.losses.categorical_crossentropy requires that in logits and labels one sample is represented by one row. Making such rearrangement can only be done with transpose. Also, switching the “meaning” of axis (from “1 column = 1 sample” to “1 row = 1 sample”) is exactly what transpose does while reshape isn’t designed for that - it really just reshapes. I suggest you to make a simple 3 x 2 logits array and run transpose and reshape on it to see their different effects.
It is interesting that you disable from_logits and use the tf.keras.activations.softmax function, however, softmax also assumes its input to have one sample represented by one row. I guess you know what change is needed here? However, enabling from_logits is better in numerical accuracy, so you need to expect a slightly different output disabling it. That difference may fail some tests.
@Mubsi, in C2 W3 Assignment function compute_cost, I think we should use tf.reduce_mean to calculate the cost (averaged logistic loss over m samples). What do you think? @Isobe_Atsushi is trying to implement the cost formula with the 1/m term.
They recently updated this assignment and switched from using the mean of the cost values at this level to the sum. The reason is that they wanted to make this work correctly for the minibatch case: compute_cost is invoked on each minibatch and it’s more correct to compute the sum of the costs at that level. You accumulate the sum over all the minibatches and then at the end of the full Epoch, you divide by m to get the overall mean of the cost values. If you compute the average at the level of the minibatches, you can’t just take the “average of the averages” to get the overall average: if you think that through, what you realize is that the math only works in the case that all the minibatches are the same size. But that is not guaranteed, right? If the minibatch size does not evenly divide m, then the last minibatch is smaller. Granted this would be a small deviation in the grand scheme of things, but it’s still incorrect.
If you look at how the cost is handled in the C2 Week 2 Optimization Assignment, you’ll see this whole idea written out in full detail in how they handle the costs there: compute_cost returns the sum and then the average is computed at the end of each Epoch of training.
@Isobe_Atsushi, I think Paul’s reply is the last piece of puzzle for your 2nd question.
Exactly!
My inital thought was more to align with what cost value is displayed in Tensorflow model training progress - it’s just adding up each minibatch’s cost and dividing the total by number of batches, disregarding the actual mini batch sizes. That’s why I didn’t see that as a problem even if it doesn’t guarantee to be a correct full-training-set cost value, but we want it to be correct and that’s the point, right?
Indeed, I could have guessed the motivation for using reduce_sum if I had checked the code line that divides the total with m
For me, not using the from_logits flag and sticking to Andrew’s explanation of the cost function
J = 1/m * Loss(yhat, y) for categorical_crossentropy
is more understandable. The error is now < 10^(-7) .
So, what is the correct answer? Do we need a fix to the notebook?
You can manually do the softmax in your compute_cost logic and use from_logits = False and that is logically correct. But it turns out you get a slightly different answer because of rounding behavior, which causes the grader to fail your code. The point of from_logits = True mode is that it is more “numerically stable”. That’s actually a “term of art” in Numerical Analysis with a well defined meaning, not just some metaphorical “hand-waving”.
You also need to do the sum and not the mean at this level, as I explained above.
Please keep in mind, as the name of the assignment suggests, “Tensorflow_introduction”, the idea is to use and get familiar with TF APIs, not the traditional formulas.
@Mubsi
That’s a good point! Anyway I passed the assignment.
I learned that TF API and traditional formula is doing the same thing, TF API does it a little better.
Under the hood, TF APIs are more or less the same as the traditional formulas. The convenience is, that instead of remembering/typing those, you can use an API instead.