Sparse_categorical_crossentropy v.s. categorical_crossentropy on C2W4

I am working on the assignment of Course2 Week4.
I have thought those two (sparse_categorical_crossentropy v.s. categorical_crossentropy) should work but actually categorical_crossentropy causes an error of model.fit;
ValueError: Shapes (None, 1) and (None, 26) are incompatible

It seems that these 1 and 26 correspond to the number of categories, because model.fit runs
when I reduce the number of outputs to 1 on the last layer.
So, categorical_crossentropy looks not work with non-binary outputs, which should not be correct.

Could you please give me an explanation?

1 Like

Hi @Noriyuki_Kushida
Welcome to Discourse
The topic has been addressed by this thread
Basically the ‘sparse_categorical_crossentropy’ is suitable when the labels are integer. The ‘categorical_crossentropy’ is used when the labels are ‘one-hot encoded’
Hope this will help.
Br

3 Likes

Thank you very much for your help.
Probably I should have been a bit more clear.
I am wondering why categorical_crossentropy works with the rock-scissor-paper example, but does not work with the hand-sign example. Those are non-binary examples, and therefore, I thought the categorical_crossentropy can be used for both cases.
Hope this makes sense.

Hi @Noriyuki_Kushida
Sorry for the large delay, but I’m back.
I understood your question.
The ‘categorical_crossentropy’ works fine in the rock-scissor-paper example because the ‘training_datagen.flow_from_directory’ has been configured with ‘class_mode=categorical’. The labels will be one-hot encoded during the ingestion step. Hence the ‘categorical_crossentropy’ is the right choice for this example.
Please take a look at this link.

BR

2 Likes

Hi,

I’ve been struggling to understand the same question for some days, and here is what I got.

About sparse categorical cross cross-entropy

The expression for sparse categorical cross entropy if tf.keras.losses.SparseCategoricalCrossentropy(from_logits = False) is used is:

-\frac{1}{hw}\sum_{h, w, c} \log \left(p_{h, w, c}\right)

where “h” is the number of rows, “w” - number of columns and “c” the number of channel. Along each channel \log \left(p_{h, w, c}\right) is taken in correspondance to a rule value. For example, say we only have two channels (c = 2) and h = w = 3. If we consider the entries \left(p_{0, 0, 0}\right) and \left(p_{0, 0, 1}\right) and the rule value is 0 it means we have to keep only \left(p_{0, 0, 0}\right). For the entries \left(p_{2, 1, 0}\right) and \left(p_{2, 1, 1}\right) if the rule value value is 1, than we have to keep only \left(p_{2, 1, 1}\right).

Now, let’s go deeper. We keep the same dimensions for the prediction (output) matrix - 3x3x2. The sum of the probabilities along each channel should sum up to 1 (otherwise, we should set from_logits = True). An example of such a matrix is:

y_pred = np.array([[[0.1, 0.9], [0.4, 0.6], [0.55, 0.45]], [[0.3, 0.7], [0.2, 0.8], [0.05, 0.95]], [[0.15, 0.85], [0.25, 0.75], [0.01, 0.99]]])
print(y_pred)

[[[0.1 0.9 ]
[0.4 0.6 ]
[0.55 0.45]]

[[0.3 0.7 ]
[0.2 0.8 ]
[0.05 0.95]]

[[0.15 0.85]
[0.25 0.75]
[0.01 0.99]]]

The first layer of y_pred is:
y_pred[:, :, 0]
array([[0.1 , 0.4 , 0.55],
[0.3 , 0.2 , 0.05],
[0.15, 0.25, 0.01]])

And the second is:
y_pred[:, :, 1]
array([[0.9 , 0.6 , 0.45],
[0.7 , 0.8 , 0.95],
[0.85, 0.75, 0.99]])

The true labels could be the following:
y_true = np.array([[0, 1, 1], [1, 1, 1], [1, 0, 1]])
print(y_true)
[[0 1 1]
[1 1 1]
[1 0 1]]

The entries in y_true are, in fact, rule values I mentioned before and relate to the 2D entries in y_pred. For example: [0, 1, 1] \rightarrow [[0.1, 0.9], [0.4, 0.6], [0.55, 0.45]], meaning that 0 \rightarrow [0.1, 0.9], 1 \rightarrow [0.4, 0.6], 1 \rightarrow [0.55, 0.45]]. 0 means we take 0.1, 1 \rightarrow we take 0.6 and 1 \rightarrow we take 0.45. In other words, the algorithm takes an entry per step and those entries are either in the first or second layer of y_pred. The rule is dictated by entries in y_true.

We will use tf.keras.losses.SparseCategoricalCrossentropy. If we evaluate it with reduction set to NONE, we get the values of the logarithms in the sum.

scce_none = tf.keras.losses.SparseCategoricalCrossentropy(reduction=tf.keras.losses.Reduction.NONE)
scce_none(y_true,y_pred).numpy()
array([[2.30258512, 0.51082563, 0.79850769],
[0.35667494, 0.22314355, 0.05129329],
[0.16251893, 1.38629436, 0.01005034]])

With a trick using exponentiation, we can roll back and see what entries of the matrix are used.
np.exp(-scce_none(y_true,y_pred).numpy())
array([[0.1 , 0.59999999, 0.45 ],
[0.7 , 0.8 , 0.95 ],
[0.85 , 0.25 , 0.99 ]])

To get the final expression given by the logarithm, we can use tf.keras.losses.SparseCategoricalCrossentropy() with no option or do it in the following way:

1/9np.sum(scce_none(y_true,y_pred).numpy()) # since hw = 9
0.644654873965515

The same result is obtained using tf.keras.losses.SparseCategoricalCrossentropy(from_logits = False):

scce_tot = tf.keras.losses.SparseCategoricalCrossentropy(from_logits = False)
scce_tot(y_true,y_pred).numpy()
0.6446548700332642

How is the training done?

Now, let’s analyze how the training is done. We go back to the actual prediction matrix of unet, which is a Nonex96x128x23 matrix. The mask matrix is Nonex96x128x1. None stands for the number of examples. Let’s take a single example. 96x128x23 are the dimensions for y_pred and 96x128x1 are for y_true. The entries in y_true are integers from 0 to 22 (23 in total). Let’s analyze the location (1, 2) across all the 23 layers in y_pred. If at (1, 2) we have 5 in y_true, then the cost function will evaluate the logarithm of the entry (1, 2) on the 6th layer, if at (40, 67) we have 0 in y_true than the cost function will evaluate the logarithm of the entry (40, 67) on the 1st layer, etc. The first 96x128x23 values in y_pred (1st forward propagation) have no “connection” with y_true (because cost function calculation comes after y_pred is evaluated). This will result in a y_pred 96x128 matrix of probabilities, and each probability is related to an integer in 0…1…22 range. Then, we evaluate the cost function and have the error. Based on that, the backpropagation will try to optimize the parameters of the network so that the error becomes smaller. For the error to become smaller, the entries selected from the 23 layers in y_pred have to tend to 1 (since they are probabilities). Since we have a sum of logarithms of numbers greater than or equal to 0 and less than or equal to 1 - -\frac{1}{hw}\sum_{h, w, c} \log \left(p_{h, w, c}\right), the result would tend to zero (thanks to “-” sign). This is, I think, how the training is done.

How is the prediction done?

We feed an image to the network and get a y_pred 96x128x23 matrix with probability entries to evaluate the model. Then, we prepare a matrix by taking the maximal element along the 23 channels. This matrix would have maximized the loss function if we had it evaluated (which we don’t do since we are in the prediction phase). This new matrix is the mask for the image.

I hope my ideas are correct.

Henrikh

1 Like