The original loss function is BCE that consists of 2 BCEs: one for real and one for fake data. Here is the definition of BCE:
\text{BCE} =y\log y+(1-\hat{y})\log (1-\hat{y})
We can formulate it as follows:
BCE = \left\{\begin{matrix} & log(\hat{y}) & & \text{if} & y = 1 \\ & log(1 - \hat{y}) & & \text{if} & y = 0 \end{matrix}\right.
In the case of GAN, we know the source of the data, so we can split the loss into 2 loss functions for true and fake images and plug y. We get:
\text{BCE}_D= \mathbb{E}_{x ∼ P_{data}}[\log D(x)]
\text{BCE}_G= \mathbb{E}_{z ∼ p_z}[log (1 - D(G(z)))]
By combining these equations, we get GAN loss for the discriminator:
L(D, G) = \mathbb{E}_{x ∼ P_{data}}[\log D(x)] + \mathbb{E}_{z ∼ p_z}[log (1 - D(G(z)))]
Which is a combination of 2 BCEs. I’m still not sure if the visualization in the video represents this function, but, as I already wrote above, having log in the equation limits our function to range [0,1 ]. If the discriminator is strong and discriminates real and fake images with higher confidence (values close to 0 and 1), the gradient will be close to 0 which causes a vanishing gradient problem.
Cross entropy is closely related to the Kullback-Leibler divergence and in the original paper, it was shown that optimizing GAN loss is equivalent to minimizing KL divergence (to be more precise, its symmetric version Jensen–Shannon divergence).
D_{KL}(p||q) = \sum_{i=1}^{N} p(x_i)\cdot (\text{log }\frac{p(x_i)}{q(x_i)})
It is easy to see that if there is no overlap between distributions we will get 0 either in numerator or dominator. The Wasserstein GAN paper nicely explains this problem.