… and that would seem to me that it is approximating the one-hot encoding:
[0, 0, 1, 0]
[1, 0, 0, 0]
and therefore CategoricalCrossEntropy should be used, but in the lab SparseCategorialCrossentropy is used instead, so I realized I don’t understand “SparseCategorialCrossentropy vs CategoricalCrossEntropy” at all.

If someone can explain the difference to me, especially how the output vectors relate to an example problem the that NN is trying to solve, that would be good. Examples would be appreciated and also maybe even a diagram?

The difference between categorical cross entropy and sparse categorical cross entropy is how you represent your labels in the dataset.
As shown in the screenshot attached, categorical cross entropy uses one hot encoded label while sparse categorical entropy uses the index value of the class as the label.
The loss computation gives you identical results in both cases. For example, the loss formula for a single training example is: loss = -y_true * log(y_pred), where y_true is the ground truth label and y_pred is the predicted label.
Consider the first vector for loss computation using both the losses where y_true = [0, 0, 1, 0] and y_pred = [6.18e-03 1.51e-03 9.54e-01 3.84e-02]
In the case of categorical cross-entropy loss will be: -[0 * log(6.18e-03) + 0 * log(1.51e-03) + 1 * log(9.54e-01) + 0 * log(3.84e-02)] = - log(9.54e-01) .
In the case of sparse categorical cross-entropy loss will be just the logarithm of the output at the ground truth index, i.e. at index 2: loss = -log(9.54e-01).

I finished week2 to see if it became clearer and I’m still not clear on a small point.

For a digit recognition NN (to recognize if an image is a 0, 1, 2, 3, 4, 5, 6, 7, 8 or 9), we might want our final output layer to have 10 nodes where the highest node output (for a particular image training example) means the NN thinks that node index (producing the highest output) is the number the image is representing (so eg if the highest output node is from node 8, then the NN is saying the image is an ‘8’) and so in this case we would use SparseCategorialCrossentropy. Is that right?

Now if we wanted to use CategorialCrossentropy instead, would we have only one output node, where that output is actually a vector of length 10? (so if that one output node produced [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0] then since the 1 is in the 8th position, the NN is saying the image is an ‘8’).
That doesn’t seem right??

Please let me know where I’m wrong (and I’m probably not quite understanding a few things here!).

Hello @evoalg, if you have ten classes, your output layer needs 10 nodes, regardless you are using CategorialCrossentropy or SparseCategorialCrossentropy.

If you choose to use SparseCategorialCrossentropy, tensorflow needs index labels. For a sample that is class 0, the label is 0.

If you chooes to use CategorialCrossentropy, tensorflow needs one-hot representation labels. For a sample that is class 0, the label is [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ]