Why iterative denoising instead of one-shot generation
Generating a photorealistic image directly from a random vector is a brutally hard learning problem — that's what GANs try to do, and it's why they're famously unstable to train (mode collapse, vanishing discriminators, fragile balance between two networks). Diffusion sidesteps the whole mess by breaking the impossible jump into a thousand tiny ones. Each step asks the model a question it can actually answer: "you're given an image that's slightly noisier than the truth — clean it up just a little". The training objective is plain MSE on noise; no adversarial game, no two-player equilibrium, no mode collapse.
Why this works at all
You only ever train on the inverse of a process you completely control. The forward process — adding Gaussian noise — has no learnable parameters; you can compute xt from a clean x0 in one shot using a closed-form formula. The model only has to learn one thing: "given this noisy image and this noise level, what noise was added?" That single learnable mapping is enough to generate brand-new samples from pure noise at inference time, because the chain of small denoising steps composes into a path from the noise distribution back to the data distribution.
Forward vs reverse process
The forward process corrupts data: at each step it mixes a fraction of fresh Gaussian noise into the previous image. After enough steps the original signal is gone and you have pure noise. This direction is fixed math — no neural net involved.
The reverse process is what the network learns: at each step it takes a noisy image, predicts the noise component, and subtracts a small amount of it to produce a slightly cleaner image. Run that loop many times starting from pure noise and you end up at a new, plausible sample from the training distribution.
Classifier-free guidance — how text conditioning works
The model is trained two ways: sometimes given a text prompt, sometimes given a "null" prompt (with some random probability, e.g. 10% of the time). At inference, you run both — once conditioned on the prompt, once unconditioned — and extrapolate away from the unconditional prediction toward the conditional one. A guidance scale s dials how strongly the prompt pulls the result. This is why Stable Diffusion has a guidance_scale knob: low values give diverse, vaguely-related images; high values nail the prompt at the cost of variety.