I have implemented the previous questions of the notebook correctly, but I am struggling with the cost function.

I have used the tf.keras.losses.categorical() fonction for this end, assigning the right tensors to the y_true and y_pred arguments, and reshaping the inputs tensors doing something like tf.reshape(labels, tf.transpose(tf.shape(labels))). Is this normal? I am getting an error. another thing is that I didn’t use the tf.reduce_mean() at all, not knowing where to use it.

Can someone give me some tips on how to solve this?

There should be no need to use “reshape” there: just transpose the logits and the labels. The other thing to note is that because we are passing “logits” as the input, meaning there is no output activation applied, you need to use the from_logits argument to the categorical loss function to tell it that. Please consult the documentation for the loss function: they give you the link in the instructions.

Then you need to use tf.reduce_sum to produce a scalar loss value: the output of the loss function is “per sample” and you need to sum across the samples. Note that the overall cost will be the average of the costs, but this cost function is designed to be the lower level that computes the total cost per minibatch. Then you take the average once you have the sum across all the minibatches.

in the latest lab it asks us to use tf.reduce_sum instead of tf.reduce_mean. the code only passes when you use tf.reduce_sum and dont divide by the number of total examples. This is odd as thats not how we learned to compute the cost function. Why is that?

Once you switch to minibatch gradient descent, it works better to keep the running sum of cost over all the samples in each batch. Once you have the total sum for the complete epoch by keeping the running sum over the batches, then you can divide by the number of samples in the full epoch to get the final average cost. If you average at the level of the individual minibatches, that doesn’t work unless all the minibatches are the same size. In the case that the minibatch size does not evenly divide the total training set, the last minibatch will be a different size, which means the average of the averages is not the same as the overall average, right?

BTW you can see this method of handling the minibatch cost in action in the Optimization Methods assignment in Week 2 of Course 2. Note that the compute_cost function they provide in the utility function file returns the sum of the cost across the samples of the current batch.