Why training is to predict noise compared to clean image directly, but generating is step by step?

When training the diffusion neural network ( U-net), it takes each one of those clean images, and add random time-step together with time-step-related noise, and train the model to predict that noise.
Therefore, after it is trained, whenever feed the model a noisy image together with its related time-step info, the model should be able to output the clean image with one step, because that is how its trained, predict that noise then get rid of the noise.

But when use the model to generate new images, it has to go through 500 or so time-steps, so-called reverse the diffusion process, step by step, remove a bit noise to a bit cleaner state, why is that? why it is not directly go to the clean state?

Could anyone help explain?

Training here is to predict the noise added in a certain step-t.

So for generation, to get the image (step-0), you need to revert from the random Gaussian distribution back, you need to incrementally call the network (predict the noise at certain step-t) and remove this predicted noise. And here you need to be cautious to not just minus noise but strictly follow the mathematic equation to mimic the reverse process of adding the noise.