Why training is to predict noise compared to clean image directly, but generating is step by step?

When training the diffusion neural network ( U-net), it takes each one of those clean images, and add random time-step together with time-step-related noise, and train the model to predict that noise.
Therefore, after it is trained, whenever feed the model a noisy image together with its related time-step info, the model should be able to output the clean image with one step, because that is how its trained, predict that noise then get rid of the noise.

But when use the model to generate new images, it has to go through 500 or so time-steps, so-called reverse the diffusion process, step by step, remove a bit noise to a bit cleaner state, why is that? why it is not directly go to the clean state?

Could anyone help explain?

1 Like