Questions about Encoding Model & Sampling Layer

I’m having trouble understanding our implementation of the output of the Encoding model and the actions of Sampling layer.

  1. My belief is that when given a batch of images, the encoding model will output batches of 2 vectors - mu and sigma. These vectors represent the mean and variance of the normal distribution characterizing the distribution of the images when they are mapped to that encoding space. If this is so, shouldn’t I expect every image to result in the same mu and sigma, since they should all come from the same distribution?

  2. The sampling layer generates a batch of vectors from a standard normal distribution in the encoding space. It then uses the mu and sigma vectors to map these vectors to the distribution of the encoded images. The resultant vector is then passed to the decoder. If this is so, the decoder is fed vectors whose only relation to the original images is having been drawn from the same normal distribution in the encoding space. In this case, why would the decoder’s output have anything to do with the original input image?

  3. The variable we call sigma is really log(variance), which would generally be called log(sigma^2). I believe this is so, based upon the fact that z is constructed using epsilonexp(0.5 sigma) and not just epsilonsigma. I also believe this is so, based on the construction of our KL-loss term. Why was this decision made? Is it some sort of best practice?

1 Like

Hi @Steven1,
About the queries you mention in point 1. A mean vector(μ) and a variance vector(σ^2), usually vary as per images, even though images came from the same distribution.
The reason behind this is that real-world data always contain natural variation.
In VAE the main challenge is to make the model generate new data points with consistent latent space distribution. To achieve this, VAE introduces a reparameterization trick. Here we get the latent code(z = μ + ε * σ, the sampled ε is then scaled by σ^2 and shifted by μ) that represents the encoded information about the input data, and because of this, you get a slightly different latent code for the same input image.
Also, regularization lets the encoder spread the latent codes for different images to ensure that it is well behaving.

Regards
Arif

Hello Steven,

Your first question’s answer is yes, you can use either mu or sigma and you get the same result.

your second questions answer
The encoder-decoder model is a way of organizing recurrent neural networks for sequence-to-sequence prediction problems.
The approach involves two recurrent neural networks, one to encode the source sequence, called the encoder, and a second to decode the encoded source sequence into the target sequence, called the decoder.

In the encoder model, when you select n_inputs which is the cardinality of the input sequence, e.g. number of features, words, or characters for each time step.
you then create a decoder model from this which creates n_output which is the cardinality of output sequence, eg number of features, words, or characters for each time step.
Which then again creates encoder-decoder model, where it passes n_units which is number of units to create in the encoder and decoder models.

The function then creates and returns 3 models, as follows:

train: Model that can be trained given source, target, and shifted target sequences.
inference_encoder: Encoder model used when making a prediction for a new source sequence.
inference_decoder Decoder model use when making a prediction for a new source sequence.

The model is trained given source and target sequences where the model takes both the source and a shifted version of the target sequence as input and predicts the whole target sequence.

During prediction, the inference_encoder model is used to encode the input sequence once which returns states that are used to initialize the inference_decoder model. From that point, the inference_decoder model is used to generate predictions step by step.

Then the predicted sequence uses in the following sequence
infenc: Encoder model used when making a prediction for a new source sequence.
infdec: Decoder model use when making a prediction for a new source sequence.
source:Encoded source sequence.
n_steps: Number of time steps in the target sequence.
cardinality: The cardinality of the output sequence, e.g. the number of features, words, or characters for each time step.
The function then returns a list containing the target sequence.

  1. Regarding when we call sigma.

In the variational autoencoder, the bottleneck vector is replaced by 2 separate vectors mean of the distribution and standard deviation error of the distribution. So whenever data is fed into the decoder, samples of the distribution is passed through the decoder. The loss function of variational autoencoder consists of 2 terms. First one is the reconstrcution loss, it is same as the autoencoder expect we have expectation term because we are sampling from the distribution.

The second term is the KL divergence term. The second term ensures that it stays within the normal distribution. We basically train to keep the latent space close to mean of 0 and standard deviation of 1 which is equivalent to normal distribution.

The mean of the vector and standard deviation representation is sampled into a vector and these samples are fed to decoder. The problem is we cannot do backpropagation or we cannot push the gradients into the sampled vector. In order to run the gradients through the entire network and train the network we will using reparameterization trick.

if you see the latent vector then it can be seen as the sum of the mu, which is the parameter you are learning, sigma which is also the parameter we are learning and multiplied by epsilon, this epsilon is where we put the stochastic part. This epsilon is always gonna be gaussian with zero mean and standard deviation of 1. So the process is we gonna sample from epsilon , multiplied by sigma and add it with mu to have latent vector. So mu and sigma are the only things we have to train and it would be possible to push the gradients to decrease the error and train the network. The epsilon, is ok not to be trained. We need the stochasticity which would help us in generating the images

Screenshot 1945-06-11 at 8.29.42 PM

It is a long read, hope it clears your doubt!!!

Happy learning!!!

Regards
DP