Hi @IvanK
As you have already understood, the latent code c is the noise that will represent meaningful features. The c is also generated randomly along with incompressible noise z. The z will give a model a freehand to map the representation on its own, but c will put some restrictions on semantic feature mapping. They are both concatenated together and passed to the generator, and therefore, the generator’s output is represented as G(z, c).
The overall goal is to maximize the mutual information between c and G(z, c) so that unlike z in a normal GAN, the information c is not lost when the generator is generating the fake samples. In other words, maximizing mutual information means that from the fake samples, i.e. output G(z, c), the latent code c can be extracted completely, and thus no information is lost. Hence, you are reducing the entropy (randomness) of latent code by extracting c, or, the mutual information is maximized.
Intuitively, the latent code c will capture the pertinent features (that can be used later to do controllable generation) because the generator is regularized by the mutual information. The way of re-extraction of c is performed by assuming a prior distribution (Gaussian in the case of Notebook) using an auxiliary network that tries to estimate the lower bound of the information I(c, G(c, z)) explained in the paper. The auxiliary network in the notebook predicts the mean and variance of Normal Distribution through which c can be reconstructed.
Now coming to the training steps given in the notebook:
Step 1. The generator takes concatenated (z, c) as input and produces G(z, c) as fake output.
Step 2. In this step, the discriminator takes G(z, c) as input. The discriminator has got two heads, one for predicting as usual as the image as fake or real, and the other one for predicting the mean and variance of latent code distribution.
Step 3. This step is similar to the previous step as this time the discriminator takes real inputs and calculates the same values.
Step 4. The discriminator’s backward pass is performed after calculating the adversarial loss and mutual information, and hence, the D’s parameters are updated.
Step 5. Now it’s time for the generator to learn from the adversarial loss of the discriminator as well as maximizing the mutual information between c and G(z, c). Hence, the fake samples are passed to the discriminator. The adversarial loss along with the mutual information is calculated and backpropagation is performed to update the generator’s parameters.