Hi,
Iâve been struggling to understand the same question for some days, and here is what I got.
About sparse categorical cross cross-entropy
The expression for sparse categorical cross entropy if tf.keras.losses.SparseCategoricalCrossentropy(from_logits = False) is used is:
-\frac{1}{hw}\sum_{h, w, c} \log \left(p_{h, w, c}\right)
where âhâ is the number of rows, âwâ - number of columns and âcâ the number of channel. Along each channel \log \left(p_{h, w, c}\right) is taken in correspondance to a rule value. For example, say we only have two channels (c = 2) and h = w = 3. If we consider the entries \left(p_{0, 0, 0}\right) and \left(p_{0, 0, 1}\right) and the rule value is 0 it means we have to keep only \left(p_{0, 0, 0}\right). For the entries \left(p_{2, 1, 0}\right) and \left(p_{2, 1, 1}\right) if the rule value value is 1, than we have to keep only \left(p_{2, 1, 1}\right).
Now, letâs go deeper. We keep the same dimensions for the prediction (output) matrix - 3x3x2. The sum of the probabilities along each channel should sum up to 1 (otherwise, we should set from_logits = True). An example of such a matrix is:
y_pred = np.array([[[0.1, 0.9], [0.4, 0.6], [0.55, 0.45]], [[0.3, 0.7], [0.2, 0.8], [0.05, 0.95]], [[0.15, 0.85], [0.25, 0.75], [0.01, 0.99]]])
print(y_pred)
[[[0.1 0.9 ]
[0.4 0.6 ]
[0.55 0.45]]
[[0.3 0.7 ]
[0.2 0.8 ]
[0.05 0.95]]
[[0.15 0.85]
[0.25 0.75]
[0.01 0.99]]]
The first layer of y_pred is:
y_pred[:, :, 0]
array([[0.1 , 0.4 , 0.55],
[0.3 , 0.2 , 0.05],
[0.15, 0.25, 0.01]])
And the second is:
y_pred[:, :, 1]
array([[0.9 , 0.6 , 0.45],
[0.7 , 0.8 , 0.95],
[0.85, 0.75, 0.99]])
The true labels could be the following:
y_true = np.array([[0, 1, 1], [1, 1, 1], [1, 0, 1]])
print(y_true)
[[0 1 1]
[1 1 1]
[1 0 1]]
The entries in y_true are, in fact, rule values I mentioned before and relate to the 2D entries in y_pred. For example: [0, 1, 1] \rightarrow [[0.1, 0.9], [0.4, 0.6], [0.55, 0.45]], meaning that 0 \rightarrow [0.1, 0.9], 1 \rightarrow [0.4, 0.6], 1 \rightarrow [0.55, 0.45]]. 0 means we take 0.1, 1 \rightarrow we take 0.6 and 1 \rightarrow we take 0.45. In other words, the algorithm takes an entry per step and those entries are either in the first or second layer of y_pred. The rule is dictated by entries in y_true.
We will use tf.keras.losses.SparseCategoricalCrossentropy. If we evaluate it with reduction set to NONE, we get the values of the logarithms in the sum.
scce_none = tf.keras.losses.SparseCategoricalCrossentropy(reduction=tf.keras.losses.Reduction.NONE)
scce_none(y_true,y_pred).numpy()
array([[2.30258512, 0.51082563, 0.79850769],
[0.35667494, 0.22314355, 0.05129329],
[0.16251893, 1.38629436, 0.01005034]])
With a trick using exponentiation, we can roll back and see what entries of the matrix are used.
np.exp(-scce_none(y_true,y_pred).numpy())
array([[0.1 , 0.59999999, 0.45 ],
[0.7 , 0.8 , 0.95 ],
[0.85 , 0.25 , 0.99 ]])
To get the final expression given by the logarithm, we can use tf.keras.losses.SparseCategoricalCrossentropy() with no option or do it in the following way:
1/9np.sum(scce_none(y_true,y_pred).numpy()) # since hw = 9
0.644654873965515
The same result is obtained using tf.keras.losses.SparseCategoricalCrossentropy(from_logits = False):
scce_tot = tf.keras.losses.SparseCategoricalCrossentropy(from_logits = False)
scce_tot(y_true,y_pred).numpy()
0.6446548700332642
How is the training done?
Now, letâs analyze how the training is done. We go back to the actual prediction matrix of unet, which is a Nonex96x128x23 matrix. The mask matrix is Nonex96x128x1. None stands for the number of examples. Letâs take a single example. 96x128x23 are the dimensions for y_pred and 96x128x1 are for y_true. The entries in y_true are integers from 0 to 22 (23 in total). Letâs analyze the location (1, 2) across all the 23 layers in y_pred. If at (1, 2) we have 5 in y_true, then the cost function will evaluate the logarithm of the entry (1, 2) on the 6th layer, if at (40, 67) we have 0 in y_true than the cost function will evaluate the logarithm of the entry (40, 67) on the 1st layer, etc. The first 96x128x23 values in y_pred (1st forward propagation) have no âconnectionâ with y_true (because cost function calculation comes after y_pred is evaluated). This will result in a y_pred 96x128 matrix of probabilities, and each probability is related to an integer in 0âŚ1âŚ22 range. Then, we evaluate the cost function and have the error. Based on that, the backpropagation will try to optimize the parameters of the network so that the error becomes smaller. For the error to become smaller, the entries selected from the 23 layers in y_pred have to tend to 1 (since they are probabilities). Since we have a sum of logarithms of numbers greater than or equal to 0 and less than or equal to 1 - -\frac{1}{hw}\sum_{h, w, c} \log \left(p_{h, w, c}\right), the result would tend to zero (thanks to â-â sign). This is, I think, how the training is done.
How is the prediction done?
We feed an image to the network and get a y_pred 96x128x23 matrix with probability entries to evaluate the model. Then, we prepare a matrix by taking the maximal element along the 23 channels. This matrix would have maximized the loss function if we had it evaluated (which we donât do since we are in the prediction phase). This new matrix is the mask for the image.
I hope my ideas are correct.
Henrikh