I don’t understand how a NN can be trained to predict noise, because noise is random and isn’t supposed to exhibit some learnable features. Can anyone help to explain? Thanks.

Hi zyzhang1130,

Here’s my two cents.

The NN is calibrated starting with pictures to which noise is added in subsequent steps. In other words, the parameters in the model are calibrated to predict a picture with some added noise, taking the picture as input and the picture with added noise as output. This is then iterated: the picture with a bit of noise now becomes the input and the picture with some more noise the output and the parameters are again calibrated. When constructing a picture from noise the parameters are fixed and the process is reversed. This has a particular direction: from the random noise, a step is made in the direction of the picture the parameters were calibrated on. To arrive at a different picture, a bit of noise is added.

So in the very end the noise is random, but the step from random noise to a bit less random noise is directed by the picture the system was calibrated on.

after watching some other diffusion model-related videos, I feel it is better to phrase it as predicting the image instead?

Hi zyzhang1130,

That makes sense to me. In being calibrated on how to get from an image with a certain amount of noise to an image with a bit more noise, the system is predicting the image with a bit more noise.

although the way it is explained here is predicting noise first then doing subtraction to get the image. I wonder why this extra step is needed if we can directly predict a less noisy version of the image (I’m referring to the denoising process)

Well, you can subtract the values of the image with more noise from the image with a bit less noise, thereby predicting the noise \epsilon that is added to, at the end, obtain a noisy image with a Gaussian distribution. In the denoising process, this predicted \epsilon can then be used to distill images from Gaussian distributed noise by subtracting noise. This is how I understand the presentation in original paper (e.g. p. 4, p. 8).