Hi, I have some questions about “Least Squares Loss”:
1.) Isn’t that just the standard loss function for linear regression, except that the label here is always 1 or 0?
2.) What exactly is different in the CycleGAN context to any other logistic regression?
As I understand it, BCE loss is required for logistic regression because this squared loss function is not convex for sigmoid activations (many local minima). Sigmoid, on the other hand, is needed to obtain a 0/1 statement from the model.
In the previous exercises, BCE-loss from Pytorch was always used, where the sigmoid function is included.
Only in the CycleGAN assessment does the loss function change to MSE and now any numerical value is compared with the 0/1 label? Why does this work here?
3.) If it works: Why not do the same in all classification tasks - why is the BCE-loss-function used anywhere if this works and is even superior to BCE-loss in terms of vanishing gradients?
It seems I’m missing something very obvious here…
Thanks in advance!
P.S.: This is a great course!
Good questions, @Bernhard_Wieczorek! I think the reason that Least Squares Loss was used in this assignment was because that’s what the official CycleGAN implementation used, and that the CycleGAN authors used it because that’s what they found worked best. Similarly, BCE loss has traditionally been used for regular GANs, which is why this course uses it when working with those GANs.
But, your point is good - if least squares loss helps avoid vanishing gradients for CycleGAN, wouldn’t it also help with regular GANs? It looks like the authors of this paper had the same idea, and did some experimentation to show the advantages of using least squares loss for regular GANs. You’re thinking like a true GANs expert!
As Wendy says, these are all interesting questions. I went back and listened to all the lectures on CycleGAN, since it had been maybe 3 or 4 years since I originally watched them. There are several levels to the answer, at least as I understand it. For starters, CycleGAN is a way more complicated situation than the “simple” GANs that we’ve seen up to this point: there are two GANs (4 total models) working in concert in a pretty complex way and then you’ve got additional requirements like wanting the full loop to produce essentially exactly the same image you started with. So you end up with a complex total loss function with 4 different terms to incorporate all the goals that we have for the system. And some of those goals, like the part expressed by the Identity Loss are pretty clearly more “distance” flavored (regression style) as opposed to classification style. It’s not just “is it a horse or not”, it’s “does it look exactly like the picture of the horse that we started with”.
But in the specific lecture about MSE my interpretation of what Prof Zhou is saying is that this is an instance of the “meta” principle that in ML there is never one single recipe that always works best in all cases. All the ML courses that I’ve taken usually start with networks that are classifiers, e.g. the famous “is there a cat in this image” case that we saw in DLS Course 1. And in all those cases, we always use sigmoid or softmax for the output activation and BCE loss as the cost function. Note that the sigmoid + BCE combination is convex only in the simplest case of Logistic Regression, which Prof Ng says we can consider as a trivial Neural Network with only one layer. As soon as you introduce a second layer to form a real fully connected NN, then the sigmoid + BCE combination is no longer convex. But if you try MSE, the solution surfaces are completely crazy. Here’s a thread with a simple example of a 2D logistic regression showing the comparison between BCE loss and MSE. So the other courses just drop the subject at that point and never talk about MSE for classifiers again.
Of course we’ve seen other examples of how we can get GANs to work better with different loss functions, e.g. Wasserstein Loss in GANs C1 W3. So even though the discriminator or critic in a GAN is doing something that is like a classification, it’s more subtle than “is there a horse in this picture”. Deciding whether an image looks real or fake apparently can apparently benefit from different cost functions, so that’s yet another degree of freedom in our hyperparameter search when we are building such a system. Sigh.
I guess we could complete the loop here and go back and try using MSE for our “is there a cat in this picture” models in the style of DLS C1 and see what happens. Science!
Hi Paul, thank you very much for your detailed answer!
Unfortunately, I still have some difficulties in understanding and would therefore like to ask you:
It seems totally understandable to me that the mapping of an input image into the latent space by the generator ( e.g. to impose a style on the content) can be seen as a regression task: “Find that specific point.”
However, this point is not explicitly compared in the loss function - we do not know it and therefore cannot use it as a label in a regression loss function.
The check as to whether this is a suitable point is only performed indirectly via the discriminator, which in my view still only performs a classification real/fake, right?
As I understand it, Wasserstein Loss makes sense as soon as I no longer make a judgement about a single image, but about a distribution: At that moment I can check the distribution of the generated images with those of the real ones. The result is then no longer binary, but a degree of congruence, i.e. a kind of regression.
Is it possible that this is the reason why at least all GANs are to be treated somewhat differently from standard classifications? After all, with a standard classification I have no scatter, e.g. an image of a 9 is always a 9, so I can only evaluate the “fidelity”, whereas with GANS the “diversity” is added and I can therefore check two criteria?
One more question:
If I have understood correctly, the Cycle-GAN Loss Function consists of the Adversarial Loss and a Cycle Consistency Loss.
However, MSE is used in the adversarial loss (AL), we define in the assignment: adv_criterion = nn.MSELoss(). The identity loss (IL) you mentioned and the Cycle Consistency Loss (CCL) are additional parts.
This increases the size of the orchestra of loss functions, but the IL and CCL cannot have a direct influence on the AL, can it? I mean in the sense that the results of IL / CCL flow directly into the AL and thus create new criteria there, which, similar to the consideration of distributions in W-Loss, enable new evaluation criteria in the AL? They are rather separate parts that all work on their own and all have to be fulfilled or optimized by the model. In other words, the model tries to find a compromise that satisfies everyone, but each loss function acts/evaluates in isolation.
Thank you also for your clarification on BCE / MSE in Standard Classifications!
Yes, I had meant that MSE simply makes it much worse by creating even more minima. But your picture in the other thread is very illustrative - thanks for that.
Please allow me to ask two more questions here too, I hope it’s not too off-topic in the GAN context.
I don’t understand your statement in the first post of this thread. Sure, the distance between 0.49 and 0.51 is the same as between 0.89 and 09.1. But wouldn’t we compare both 0.49 and 0.89 with 1.0 (True) if Label=True and a change from 0.49 to 0.51 would bring the same percentage improvement as 0.89 to 0.91? Isn’t the problem rather the generation of minima when I use sigmoid, 1/(1+e^-z), in a quadratic loss function?
Another general question about local minima: I understood Prof Ng in the DeepLearning course to mean that as the number of parameters increases, it becomes less and less likely that there will be local minima, because the chance increases that one of the many dimensions will still have a negative slope. Prof Ng spoke of plateaus where the gradients are low and it takes a long time to leave the saddle point. A small DNN with 3 layers of 16, 16, 10 units with the classic MNIST 28x28 input already has approx. 13,000 parameters. That already feels like enough dimensions to me. Would you agree with this or do we still have to assume local minima here?
I hope it’s ok to ask you this, but I can’t find any sources that deal with this: Prof Ng’s statement makes perfect sense to me, yet everyone is always talking about local minima…
Many, many thanks in advance, Paul - I hope it’s ok to write a huge post here and ask a thousand things!
Best regards,
Bernhard
There are a lot of questions there. Maybe I will need to take a “divide and conquer” approach rather than create one huge answer. So let me parse things into subtopics.
I don’t remember anywhere in DLS where Prof Ng says anything that could be interpreted that way. The more parameters you have, the more complex your solution surfaces are and the more local minima you will have. There’s never any hope that you will find a solution that is not a local minimum. In fact, finding the absolute minimum would probably represent extreme overfitting in any case. But it has been shown that for sufficiently complex problems, there is a band of local minima which are very likely to be found in gradient descent which are actually reasonable solutions. So what he does say is that it turns out in real solutions that the “local minimum” issue is not that big a deal.
Here’s a thread which talks about the work from Yann LeCun’s group that discusses the math showing that local minima are not really a problem and it also links to a thread which deals with the huge number of local minima created by weight space symmetry.
Yes, this is a good point. Sorry, my example is not really that relevant.
Yes, that description sounds right to me: the loss terms act independently and the solution is a compromise. Of course any change in the weights affects potentially all of them. Note that we also have some hyperparameter tuning work to do in selecting the appropriate values for the weighting factors \lambda_n for the various loss terms.
I realize there are lots more questions that I haven’t gotten to yet. It may be another day before I can muster the time to respond in detail on those.
Hi Paul, thanks for your time and the link to Yann LeCun’s paper. I’ll need some time to read it carefully, I’m looking forward to it!
Well, then I obviously misunderstood Prof Ng. Can you please help me where I have my thinking error?
I am referring to this lesson in DeepLearning, course 2, week 2:
I would like to quote Prof Ng from it:
"It turns out that if you are plotting a figure like this in two dimensions, then it’s easy to create plots like this with a lot of different local optima.
And these very low dimensional plots used to guide their intuition.
But this intuition isn’t actually correct.
It turns out if you create a neural network, most points of zero gradients are not local optima like points like this.
Instead most points of zero gradient in a cost function are saddle points. "
Further:
“… if you are in, say, a 20,000 dimensional space, then for it to be a local optima, all 20,000 directions need to look like this.
And so the chance of that happening is maybe very small, maybe 2^-20,000. …
So that’s why in very high-dimensional spaces you’re actually much more likely to run into a saddle point …, then the local optimum.”
Prof Ng also says that plateaus are a problem because they slow down training, possibly I then have a de-facto minimum because I stop training at some point?
However, he also references methodologies such as Adam as a remedy:
“So the takeaways from this video are, first, you’re actually pretty unlikely to get stuck in bad local optima so long as you’re training a reasonably large neural network, save a lot of parameters, and the cost function J is defined over a relatively high dimensional space.
But second, that plateaus are a problem and you can actually make learning pretty slow. And this is where algorithms like momentum or RmsProp or Adam can really help your learning algorithm as well.”
In particular, your statement that a local optimum may not even be desirable because it means overfitting is something I hadn’t thought about in that way, thanks for the food for thought!
I know that I can avoid overfitting with regularization, dropout, batch norm and a sufficiently large training set, but what influence reg and dropout in particular have on the loss surface I can’t really imagine, I suppose:
I know that I can avoid overfitting with regularisation, dropout, batch norm and a sufficiently large training set, but what influence do reg and dropout in particular have on the loss surface? I assume so:
Reg. permanently interferes with the flexibility of the gradient descent algorithm from my point of view, since parameters are limited and thus “paths” on the loss surface are blocked, right? So do these kinds of “blockages” possibly turn saddle points into local minima again, because I am manually limiting the degrees of freedom of the high-dimensional space?
Dropout does not affect the loss surface according to my intuition, because I only suppress inputs in the Foward-Prob?
That’s not what I said. What I said was that finding the actual overall global minimum is not desirable. That is not the same thing as a local minimum. People frequently assume that our goal is to find the actual global minimum of the cost, but that’s not actually true for the reason I mentioned.
That is a different statement that does not conflict with what I said. Just because you have even more saddle points in a higher dimensional space does not mean you have fewer local minima or local maxima. You may well have more of all of them. Perhaps the ratio of saddle points to local optima may be higher, but the absolute numbers of all of them can be higher. But I think we’re into “angels dancing on the head of a pin” territory here. The bottom line is what Prof Ng says here:
Mind you, I have not actually read the Yann LeCun paper beyond the abstract and the first page or thereabouts, so I don’t know the math presented therein, but I’m assuming that Prof Ng’s statement above is congruent with (and very likely based on) what is in the paper.
BTW I found that paper through some link or reference that Prof Ng gave, although I can’t actually remember where that was. Very likely it was in some of the notes here in DLS Course 2, but it might have been from one of his YouTube videos as well.
Sorry, my typo: I actually meant global minimum.
I always thought that if you have enough good data that covers the later use case well (and possibly use some regulation methods) the best possible adaptation to it is desired.
I have always seen this as the global minimum.
As I said, your input is food for thought for me, something I will have to think about for a while…
I must have completely misunderstood Prof Ng here.
I interpreted the statement to mean that local minima turn into saddle points when the number of dimensions increases, so no more local minima, but saddle points (of course with limited parameters it’s never no but only fewer).
I don’t want to steal your time with irrelevant discussions! Obviously I have a misunderstanding.
I will read the paper…
Hi Paul,
I have read the paper, very informative, thanks for the link!
Even if I don’t fully understand the mathematical proofs, there is a clear insight:
There is a difference between “good” and “bad” local minima, (good = small remaining error, bad = high remaining error of the loss function).
I did notice that Prof Ng spoke of “bad” local minima that become saddle points in the quoted video.
However, I did not realise in the video that there are also “good” local minima to which this does not apply.
From this, I erroneously oversimplified that all local minima become saddle points.
The paper also explains very revealingly that remaining local minima are arranged in a band of similar remaining errors and thus lead to comparable performance of the neural network, regardless of which specific minimum was reached in training - a behavior that I have already observed even with small experiments.
I am glad that I have now been able to correct this misunderstanding.
Thank you very much for your help!
To bring this discussion back into the GAN context and because this question is still bothering me, may I ask you to share your opinion on this as well?:
I’m very glad to hear that you read the whole LeCun paper and that it was informative. But I think you’re still misinterpreting what Prof Ng said. There is no “becoming” saddle points. With a given network architecture and a given set of training data, the network “is what it is”. We are traversing the cost surface by following the gradients. If you have a point which has zero gradients, then it’s either a saddle point, a local minimum or a local maximum. Well, it could also be the global minimum, but the global minimum is also a local minimum. As I think I commented earlier every square is a rectangle, but not every rectangle is a square. Also note that the global minimum is not unique, because of weight space symmetry.
I will hope to get back to the GANs specific discussion in a few hours, but have other things to attend to right now.
Yes, of course: A model is as it is, and so is the associated loss function. However, if I change the model, e.g. make it larger, then the associated loss function changes.
By “turn into”, I don’t mean that a specific minimum becomes a saddle point, but that a sufficiently large model no longer has any bad minima, but that only saddle points occur above the residual error band in which the “good” minima lie.
I think you’re continuing your trend of just reading too much into things. Does the LeCun paper actually say what you just said? I would bet you all the beer you can drink in a single sitting that it does not say that. My understanding is that there is never a case in which you don’t have any bad minima and there is never any guarantee you’ll get a “good” minimum value on your first try. It’s just that once the model is sufficiently complex, the probabilities are more in your favor of being able to find a sufficiently good minimum after a reasonable number of tries.
I recommend you discontinue worrying about the small details of local minima and saddle points. It’s not significantly important.
What’s important is that you understand how to determine if the minimum you have found gives “good enough” performance for the model you’re trying to create.
Yes, you’re both right: there are no guarantees, only probabilities and the details are just details.
I’m not here to argue, I’m here to learn - I have. Thanks for the feedback.