Is it forced to use binary_crossentropy?

Hey there,

just figure out that in the assignment there is a statement:

compute the reconstruction loss (hint: use the mse_loss defined above instead of bce_loss in the ungraded lab, then multiply by the flattened dimensions of the image (i.e. 64 x 64 x 3)

Looks like a good hint, I find out it is logical because the autoencoder stuff is not pure classification or categorization cases. :laughing:(correct me if it is wrong), I think using binary_crossentropy or categorical_crossentropy is not consistent.

So I turn back to the ungraded lab, find out that the lab is using binary_crossentropy which expects probabilities according to the library source code.

Now I get confused, because when I change the loss to mean_square_error, the lab still works fine, even with binary_crossentropy(from_logits=True) (I guess & have not tried).

The question is whether binary_crossentropy is the single alternative loss for this or I was wrong?

Or just because the final output of the decoder in the ungraded lab is

tf.keras.layers.Conv2DTranspose(filters=1, kernel_size=3, strides=1, padding='same', activation='sigmoid', name="decode_final")(x)

which means the output is Width x Height x 1 which contains only 1 unit?

Or is there potential rules that

When > 1 unit we shall use mean_square_error, otherwise then with binary_crossentroy ?


Binary cross entropy is used for classification problem, while mean squared error is for regression problem. Although mse could still be used in classification problem, it is not a recommended loss metric due to nonconvexity in binary classification case.
See Why Using Mean Squared Error(MSE) Cost Function for Binary Classification is a Bad Idea? | by Rafay Khan | Towards Data Science.

1 Like

thanks, @jackliu333 , but my question additional and confusing more, is the autoencoder a classification or pure regression like bounding box stuff? :woozy_face:

autoencoder is a general architecture. Depending on the type of output at the final layer, it can support both classification and regression.

I found some relevant discussions of this under these links:

According to the math that is also discussed in the arXiv paper linked at class, if we use Gaussian prior for the latent representation, we should use MSE loss. However, the MNIST VAE showcase works well with BCE loss and to me it seems that the learning gets stuck on a plateau when I try MSE loss.
From what I read, I gather that the BCE loss works well in this case because the input distribution is close to Bernoulli, i.e., with good approximation, there are almost only black and white pixels (0ā€™s and 1ā€™s). But then Iā€™m not sure why we use a Gaussian prior for this exercise in the first place.