As Wendy says, these are all interesting questions. I went back and listened to all the lectures on CycleGAN, since it had been maybe 3 or 4 years since I originally watched them. There are several levels to the answer, at least as I understand it. For starters, CycleGAN is a way more complicated situation than the “simple” GANs that we’ve seen up to this point: there are two GANs (4 total models) working in concert in a pretty complex way and then you’ve got additional requirements like wanting the full loop to produce essentially exactly the same image you started with. So you end up with a complex total loss function with 4 different terms to incorporate all the goals that we have for the system. And some of those goals, like the part expressed by the Identity Loss are pretty clearly more “distance” flavored (regression style) as opposed to classification style. It’s not just “is it a horse or not”, it’s “does it look exactly like the picture of the horse that we started with”.
But in the specific lecture about MSE my interpretation of what Prof Zhou is saying is that this is an instance of the “meta” principle that in ML there is never one single recipe that always works best in all cases. All the ML courses that I’ve taken usually start with networks that are classifiers, e.g. the famous “is there a cat in this image” case that we saw in DLS Course 1. And in all those cases, we always use sigmoid or softmax for the output activation and BCE loss as the cost function. Note that the sigmoid + BCE combination is convex only in the simplest case of Logistic Regression, which Prof Ng says we can consider as a trivial Neural Network with only one layer. As soon as you introduce a second layer to form a real fully connected NN, then the sigmoid + BCE combination is no longer convex. But if you try MSE, the solution surfaces are completely crazy. Here’s a thread with a simple example of a 2D logistic regression showing the comparison between BCE loss and MSE. So the other courses just drop the subject at that point and never talk about MSE for classifiers again.
Of course we’ve seen other examples of how we can get GANs to work better with different loss functions, e.g. Wasserstein Loss in GANs C1 W3. So even though the discriminator or critic in a GAN is doing something that is like a classification, it’s more subtle than “is there a horse in this picture”. Deciding whether an image looks real or fake apparently can apparently benefit from different cost functions, so that’s yet another degree of freedom in our hyperparameter search when we are building such a system. Sigh.
I guess we could complete the loop here and go back and try using MSE for our “is there a cat in this picture” models in the style of DLS C1 and see what happens. Science!