Problem with BCE loss video question

Hey, everyone! I hope that you are all fine and sound.

I have a question in regard to the cost function chart of BCE loss in the video “Problem with BCE loss”.

According to math-stackexchange the derivative of BCE is this:

image

And clearly, with a lower distance between the fake and real distributions, the BCE loss skyrockets to infinity because of the log functions. Moreover, with a higher distance, our cost inclines to zero.

I am confused with two things:

  1. I don’t understand why we have zero gradients at the start of and the end of the cost function chart.
  2. I don’t understand why we have a higher loss value when we have a bigger distance between the fake and real distributions in the following figure.

2 Likes

As you can see, the loss function is flat on both sides. If we take the derivative at these points, the signal will be very weak (the slope of the tangent is close to 0). This is called the vanishing gradient problem:

As for the second question, this is how I understand the problem. This is because the discriminator is very confident and outputs values close to 0 for fakes and 1 for real images. The distance between the distributions can also be interpreted as follows: the discriminator can easily distinguish between real and fake images, while the generator is weak and cannot produce images that are close to real ones.

For the generator training step, we pass the true labels along with the generated images to compute the gradients that point towards the real distribution. Since we pass true labels, the loss function becomes \log(D(x_{\text{fake}})) and if the discriminator is very confident it will output values close to zero, so we get a big negative number.

I’m not sure about the derivative formula you posted, but if we plug in the values, y=1 for the true label and \hat{y}=0 for the prediction, we will get \frac{1}{0}.

There is also another interpretation of this problem. Optimizing BCE loss for GANs is equivalent to minimizing Kullback-Leibler divergence between to distribution (remember that BCE is the entropy of the distribution plus D_{\text{LK}}). If the distributions have disjoint support (i.e., do not overlap), D_{\text{KL}}(P || Q)=+\infty. The Wasserstein GAN paper nicely explains this problem

1 Like

I would like to thank you, Aray, for your comprehensive answer.

I think that I have to take a close look at the Wasserstein GAN paper that you mentioned.

However, looking at the figure I posted earlier, it can be inferred that our cost function is limited by lower and upper bounds. In this case, the gradient of our loss function close to those areas is approximately zero. However, my confusion is that the BSE loss, as you mentioned, can grow to +infinity as we plug y=1 for the poor fake images. Hence, the cost function should not be limited by an upper bound, and consequently, the gradients at the extreme areas will not be any value close to zero.

There might be a misunderstanding of the cost function for me; I would be glad if you could help me with it.

I’m not 100% sure about the shape of the loss function showed in the video (it may represent a loss of the discriminator with both true and fake samples), but here is how I understand the problem:

The BCE loss can be expressed as follows:

CE = \left\{\begin{matrix} & - log(\hat{y}) & & \text{if} & y = 1 \\ & - log(1 - \hat{y}) & & \text{if} & y = 0 \end{matrix}\right.

Since training the discriminator is just a binary classification task, I will concentrate on the training step of the generator. If you’re interested in the details, here’s an illustration:

As I already mentioned we feed fake images into the discriminator and pass true labels to the BCE loss. The key point is that we pass true labels for images that are fake, so the loss, in that case, will be - log(D(x_\text{fake})). Here is an illustration of this function:

image

If the discriminator is confident and the generator is weak (big distance between distributions), D(x_\text{fake}) will output values close to 0, so the loss will be high.

Hope someone will correct me if I am wrong :slight_smile:

1 Like

Hi @ParetoFront! Welcome to the community.

You’ve raised a very interesting point, that might indeed be a little confusing in the video. Here’s my take on it.

Quoting the lecture:

At the end of this minimax game, the generator and discriminator interaction translates to a more general objective for the whole GAN architecture. That is to make the real and generated data distributions of features very similar. Trying to get the generated distribution to be as close as possible to the reals. This minimax of the Binary Cross-Entropy loss function is somewhat approximating the minimization of another complex cost function that’s trying to make this happen.

In summary, we want the real and generated distributions to be as close as possible. I believe the J that is plotted is actually not the BCE Loss, but rather this underlying cost function that measures how different the distributions are.

This cost function, as stated, is complex and might be difficult to express analytically. I try to think of it as being inversely correlated with the overlap between the distributions, meaning it is low for a perfect overlap and high for little overlap.

There are a few things we can infer for J:

  • There is a limit to how different the distributions can be — after all, the values must lie between 0 and 1. This means that this J cannot go up to infinity.
  • When the distributions are very far apart, moving them a little closer does very little help. Thus, the derivative of J should be small when the distance is too large.

I believe these are the points that were confusing for you. Let me know if they make sense now.

3 Likes

Yeah, that J plot is definitely the underlying cost function, not the BCE Loss, as stated in the end of the lecture:

In summary, GANs try to make the generated distribution look similar to the real one by minimizing the underlying cost function that measures how different the distributions are. As a discriminator improves during training and sometimes improves more easily than the generator, that underlying cost function will have those flat regions when the distributions are very different from one another, where the discriminator is able to distinguish between the reals and the fakes much more easily, and be able to say, “Reals look really real, a label of one and fakes look really fake, a label of zero.” All of this will cause vanishing gradient problems.

2 Likes

Thank you again for your answer, Aray. I really appreciate the time and effort you put into answering me.

I think whenever the logarithm is involved in our loss function, this unfortunate +inf loss happens. However, does the Earth Mover Loss have the same issue?

Additionally, seeing the figure you provided, stroke me with another question. Do the generator and discriminator have to be always symmetrical? I thought since the discriminator’s job is a lot easier than the generator, can we shed some hidden layers off the discriminator. Especially the deeper the discriminator, the more layers of chain rule involved in the backpropagation of the generator.

Thank you.

Hey Pedrorohde, thank you very much for your answer!

It makes sense to me now, and my confusion over that part is alleviated. Is this underlying cost function that you mentioned another term for the Earth Movers function?

Update: The Earth mover’s loss is linear, and its gradient is the same across every point on the function. Therefore, it should not be the Earth Movers loss. Can you shed light on the underlying cost function, please?

The generator in the Wasserstein GAN has no activation function in the output layer. That is, it has a linear activation and hence it does not output labels for true and fake images and therefore the output does not have to be between 1 and 0. It instead outputs a real-valued metric. That is why the authors called it a critic. The WGAN loss is very simple. For the critic it is D(x) - D(G(z)) and for the generator it is just D(G(z)). As you can see, there is no \log, but WGAN has other problems, such as the Lipschitz constraint.

As for the architecture in the illustration, it is DCGAN (deconvolution-convolution), where the generator and discriminator are symmetric, which is optional. For example, StyleGAN2 uses different architectures for the generator and discriminator.

The size of both networks is very important. Typically, the larger the network, the more capacity it has to capture important characteristics of the data. SOTA architectures generally use skip connections such as UNet or ResNet to overcome the vanishing gradient problem.

3 Likes

Thank you very much, Aray! Your answers made many concepts more clear to me.

Hi @pedrorohde,

Thank you for your explanation, you cast light on the fact that the loss function represented in the chart is NOT the BCE loss, but a more complex function. This being said, do you or does anyone have a reference to provide that explains how this complex loss is derived. You provide a great intuition that related the two flat zones at the extremes with the distribution overlapping, but I would like to unveil the analytical part of it.

Right now, using the BCE loss, I can recover only part of the story. When feeding samples from the generator, we want to maximize –(1-y)*log(1-h) where y=0 (generated distribution). Then, if we compute the gradient of the loss wrt the pre-sigmoid logit (denoted by z), we have gradient=dh/dz*1/(1-h). In the case the real and the generated distributions are clearly different, h \approx 0 and so gradient = dh/dz \approx 0. But then, as dh/dz = h*(1-h), gradient = h, and so there is no other cases where we can conclude h \approx 0, so I do not see how to derive it where the distributions overlap.

Thank you!

The original loss function is BCE that consists of 2 BCEs: one for real and one for fake data. Here is the definition of BCE:

\text{BCE} =y\log y+(1-\hat{y})\log (1-\hat{y})

We can formulate it as follows:

BCE = \left\{\begin{matrix} & log(\hat{y}) & & \text{if} & y = 1 \\ & log(1 - \hat{y}) & & \text{if} & y = 0 \end{matrix}\right.

In the case of GAN, we know the source of the data, so we can split the loss into 2 loss functions for true and fake images and plug y. We get:

\text{BCE}_D= \mathbb{E}_{x ∼ P_{data}}[\log D(x)]
\text{BCE}_G= \mathbb{E}_{z ∼ p_z}[log (1 - D(G(z)))]

By combining these equations, we get GAN loss for the discriminator:

L(D, G) = \mathbb{E}_{x ∼ P_{data}}[\log D(x)] + \mathbb{E}_{z ∼ p_z}[log (1 - D(G(z)))]

Which is a combination of 2 BCEs. I’m still not sure if the visualization in the video represents this function, but, as I already wrote above, having log in the equation limits our function to range [0,1 ]. If the discriminator is strong and discriminates real and fake images with higher confidence (values close to 0 and 1), the gradient will be close to 0 which causes a vanishing gradient problem.

Cross entropy is closely related to the Kullback-Leibler divergence and in the original paper, it was shown that optimizing GAN loss is equivalent to minimizing KL divergence (to be more precise, its symmetric version Jensen–Shannon divergence).

D_{KL}(p||q) = \sum_{i=1}^{N} p(x_i)\cdot (\text{log }\frac{p(x_i)}{q(x_i)})

It is easy to see that if there is no overlap between distributions we will get 0 either in numerator or dominator. The Wasserstein GAN paper nicely explains this problem.

Hello aray, could you please explain
how is BCE range is in between 0 and 1?
image
as you said before. If y = 1 and y hat is close to zero, then the loss approaches +inf. While training the generator we use -log(y hat) right?

here a is the model output

del a = (a - y)/ (a - a^2)

In the lecture it says(as I understand it) that if the discriminator is too confident that the generator’s output is fake, it stops learning or gradient approaches zero.

But if we put y=1 and yhat = 1e-3 because generator wants its fake images classified as real and discriminator is very confident those are fake, we get del yhat = -1000. So the gradient in this step is really large. How does it leads to vanishing gradients?

I think the formula for del a should be

del a = (a - y)/ (a - a^2)

I think you misunderstood the idea that I wanted to convey. The part with y and \hat{y} is only related to the binary cross-entropy. These two variables represent the probability that the image is real. y denotes ground truth, while \hat{y} is a prediction. I just wanted to show that the final loss is a combination of 2 BCE for both real and fake images, which is:

L(D, G) = \mathbb{E}_{x ∼ P_{data}}[\log D(x)] + \mathbb{E}_{z ∼ p_z}[log (1 - D(G(z)))]

where the first term is \text{BCE}_D= \mathbb{E}_{x ∼ P_{data}}[\log D(x)] and the second one is \text{BCE}_G= \mathbb{E}_{z ∼ p_z}[log (1 - D(G(z)))]

If D is very confident, it will output 1 for real and 0 for fake images. If we plug it into the loss, we get:

L(D, G) = \mathbb{E}_{x ∼ P_{data}}[\log (1)] + \mathbb{E}_{z ∼ p_z}[log (1 - 0)]

Since log(1)=0, we will get 0 gradients.

image
But isn’t BCE{D} would be just the same BCE equation as we train it on both generated and the real data and while calculating BCE{G} we use label=1.
Sorry, I am not able to understand this relation. The graph on the slide most probably not of the BCE loss but other function.

Sorry, I didn’t get your question. Once again, since we know the source of the data and hence y, we plug it into the BCE equation for both real and fake data. We then combine these to BCEs into the final GAN loss, which will be:

L(D, G) = \text{BCE}_D + \text{BCE}_G = \mathbb{E}_{x ∼ P_{data}}[\log D(x)] + \mathbb{E}_{z ∼ p_z}[log (1 - D(G(z)))]
1 Like

The comment about vanishing gradient when D becomes very good really did not make sense to me.

@aray what you have written is the “value” function, not the loss function. To make this into a loss function we’d have to stick a - sign in front of it. As it stands, D is trying to maximize it and G is trying to minimize it by trying to get D(G(z)) \approx 1 thereby driving the second term towards -\infty. However, this is not how we implement it in Week 1 exercise. We have two different optimizers and the one for G is simply trying to maximize log(D(G(z)), again by pushing D(G(z)) \approx 1.

Going back to the original question of this thread - if D becomes really good at identifying fakes, it will always output D(G(z)) \approx 0, which means that the gradient of log(D(G(z)) will approach 1! The gradient will not vanish in this case.

Upon further reflection, gradient of the original value function -

V = \mathbb E \left[log(D(x)) \right] + \mathbb E \left[ log(1-D(G(z))) \right]

does go to 0 as D becomes very good. Both the first and second terms will go to 0, which means that even though G only sees the second part, it will still be affected by it. Our implementation with two optimizers in the Week 1 exercise was probably to counter this effect which is why it trained pretty nicely.