Why training is to predict noise compared to clean image directly, but generating is step by step?

Training here is to predict the noise added in a certain step-t.

So for generation, to get the image (step-0), you need to revert from the random Gaussian distribution back, you need to incrementally call the network (predict the noise at certain step-t) and remove this predicted noise. And here you need to be cautious to not just minus noise but strictly follow the mathematic equation to mimic the reverse process of adding the noise.