I have been having trouble understanding what I need to do for Exercise 6 of Week 3 (Computing the loss function method: compute_total_loss) using tensor flow. I thought I would need to simply invoke the cross entropy tf method, compute the reduce_sum and return the total loss. Below is the code fragment:
{moderator edit - solution code removed}
That didn’t seem to help. Reading thru’ the TF docs, it seemed like the crossentropy function expected the logits to be predictions. Upon inspecting the inputs to the compute_loss_function which seemed like plain numbers, I thought I could use the sigmoid or the softmax functions in TF and use the output of that as input to the crossentropy function (like below).
{moderator edit - solution code removed}
Neither passing in the softmax or the sigmoid_output in place of logits seemed to work. I also tried setting the property from_logits for crossentropy to false/true. But they didn’t seem to help either. I seem to be getting a total_loss of
tf.Tensor(0.88275003, shape=(), dtype=float32)
while the expected value is
tf.Tensor(0.810287, shape=(), dtype=float32)
I am kind of stuck and am not sure how to proceed. Any help would be appreciated.
The key point is the fact that the output of layer 3 is the “logits” and not the softmax output. So you need to include softmax in the results. There are (as you say) two ways to do that:
Manually add softmax.
Use from_logits = True to include the softmax in the loss calculation.
Option 2) is the preferred method because it’s less code for you to write and more numerically stable.
There is one other thing you didn’t mention here: you also need to transpose the labels and logits to get the correct answer. This was discussed in the instructions.
Thank you so much for the response. I had totally missed the part that I needed to transpose both the labels and the logits. I followed the ideas you had provided and that did it ! Thank you so much!
Seems like a candidate for more than “1 line of code.” To be fair, it could be one line of code; it would just be a long line.
Also on my lab the documentation link points to tf.keras.metrics.categorical_crossentropy while the text says tf.keras.losses.categorical_crossentropy … not sure if that makes any difference though. Seems like metrics is actually the right one to use.
As to the number of code lines, that is a programming style point and those are always just suggestions. The grader never looks at your actual code: it just calls your functions and checks the output values. I totally agree that clarity and maintainability may well be better served by more rather than fewer lines, provided that they are appropriately “tasteful”.
On the question of which loss function to use, I think they are just two APIs for the same underlying function. The OOP is getting pretty thick here with more subclasses than you can shake a proverbial stick at. Here’s the docpage for the other one. There is one additional argument provided by the “losses” version, but it’s not anything we care about. Either one should work.
Implement the total loss function below. You will use it to compute the total loss of a batch of samples. With this convenient function, you can sum the losses across many batches, and divide the sum by the total number of samples to get the cost value.
Here, divide the sum by the total number of samples is misleading
What they say is correct if you think carefully about what is being said. We only divide by the total number of samples at the end of one full pass of training (all the minibatches). But the function we are writing here is computing the cost for one minibatch, so we only take the sum. The higher level logic will compute the running sum across all the minibatches and then compute the average when it is finished with the pass. You can’t compute the average at the minibatch level, because the math doesn’t work if all the minibatches are not the same size. That will happen if the minibatch size does not evenly divide the total batch size. So you can’t get the overall average by taking the average of the averages.
If you were paying close attention, this is exactly how it worked when we first implemented minibatch gradient descent in the previous assignment (C2 W2 A1 Optimization). It’s the same here, but now we’re doing it in TF instead of straight numpy.
Because the dimensions of the output of the forward propagation are features x samples and the TF loss function requires the orientation with samples as the first dimension. They do mention in the instructions that you need to be aware of that. It’s also never a bad idea to read the documentation for the TF functions they are advising you to use. This is the intro to TF so there is a lot to learn.
Sure, since this is the very first assignment using TF, it would be nice for them to mention it or show an example. But you know that you need a transpose operation and you’re using TF, so try googling “how do I transpose a tensor in TF”. We can thank google for search and for creating TF.
I’ll file an enhancement request asking that they should at least say something like “Hint: you will need the tf.transpose function for this purpose” in the section of the instructions where they mention the dimension issue.
For anyone else stuck: The first bullet in Exercise 6 is key: “[inputs of categorical_crossentropy] are expected to be of shape (number of examples, num_classes).”
For feedback: It helps some of the earlier code uses tf.transpose, but I agree that the rest of the assignment and other assignments are usually much more clear about what to do. Perhaps add to this bullet "you can use tf.transpose() if you need it)?