Mode

Key idea

Modelling P(data) instead of P(label | data). Once you've learned the data distribution, you can sample new data, score the likelihood of new data, infill missing parts, or condition on side-information for controlled generation. Different families make different trade-offs between sample quality, likelihood, speed, and ease of training.

Compare the four major families on a simple 2D distribution — same data, very different generative behaviours

A simple 2D target distribution (a ring) shown four ways. Each family is good at different things. VAE — fast, blurry. GAN — sharp samples, mode collapse risk, no likelihood. Flow — exact likelihood, restricted architecture. Diffusion — high quality, slow sampling. Autoregressive — exact likelihood, sequential sampling.

VAE (Variational Autoencoder). Encode to a Gaussian latent; decode back. Train with reconstruction + KL-to-prior. Fast sampling (one forward pass); usually blurry; exact likelihood (sort of — an ELBO bound). See Variational Autoencoders.

GAN (Generative Adversarial Network). Generator and discriminator in a minimax game. Sharp samples; no likelihood; mode collapse common; finicky training. See GANs.

Normalising flows. Invertible transformations from a simple base distribution to the data distribution. Exact likelihood and exact sampling; architecture restricted to invertible maps. Useful when you need likelihoods (anomaly detection, density estimation).

Diffusion. Iteratively denoise from Gaussian noise. Slow sampling (10s–1000s of steps), highest sample quality, stable training. State of the art for images. See Diffusion Models.

Autoregressive. Model P(x) = ∏ P(xi | x<i). Exact likelihood; sequential (slow) sampling; the architecture behind every modern LLM and PixelRNN/CNN.

Pick by what you need

  • Sample quality: diffusion > GAN ≈ AR > flow > VAE
  • Sample speed: GAN ≈ VAE ≈ flow > AR > diffusion (parallel: AR loses)
  • Exact likelihood: flow = AR >> VAE (lower bound only) >> GAN (none)
  • Mode coverage: AR ≈ diffusion > flow > VAE > GAN
  • Training stability: AR > diffusion > flow ≈ VAE > GAN

None is a free lunch

  • Diffusion: slow at inference unless distilled
  • GAN: unstable, mode collapse, no likelihood
  • Flow: architecture constraints hurt quality
  • VAE: blurry; high-quality VAE needs a lot of capacity
  • AR: sequential, slow for long sequences

# Each family in one breath
import torch, torch.nn as nn, torch.nn.functional as F

# VAE — reconstruct + KL
mu, logvar = encoder(x)
z = mu + (0.5 * logvar).exp() * torch.randn_like(mu)
rec = decoder(z)
loss = F.mse_loss(rec, x) - 0.5 * (1 + logvar - mu ** 2 - logvar.exp()).sum()

# GAN — discriminator tries to tell real from fake
fake = generator(torch.randn(B, z_dim))
d_loss = F.binary_cross_entropy_with_logits(D(real), torch.ones(B)) \
       + F.binary_cross_entropy_with_logits(D(fake.detach()), torch.zeros(B))

# Diffusion — predict noise added at random timestep
t = torch.randint(0, T, (B,))
noise = torch.randn_like(x)
xt = sqrt_alpha_bar[t] * x + sqrt_one_minus[t] * noise
loss = F.mse_loss(model(xt, t), noise)

# Autoregressive — next-token cross-entropy
logits = model(x[:, :-1])
loss   = F.cross_entropy(logits.flatten(end_dim=-2), x[:, 1:].flatten())

Methods in this section

Work through them as one arc — each keeps the "sample a latent code, decode it" skeleton and changes one thing:

  • Probabilistic PCA (PPCA) — the linear-Gaussian blueprint. A straight-line decoder; everything in closed form.
  • Variational Autoencoders — swap the line for a neural net; approximate the posterior with an encoder; train the ELBO.
  • Generative Adversarial Networks — drop the likelihood entirely; train the decoder to fool a critic. Sharp samples, finicky training.
  • Diffusion Models — make the decoder a long denoising chain. Best quality, slow sampling, the current state of the art.
Want EBMs, score-based models, and conditional generation?

Score matching

$$ \mathcal{L} = \mathbb{E}_{x \sim p_{\text{data}}}\big\lVert \nabla_x \log p_\theta(x) - \nabla_x \log p_{\text{data}}(x) \big\rVert^2 $$

  • Match the gradient of log-density, not the density itself
  • Avoids the partition function
  • The foundation of score-based and diffusion models

$$ \text{loss} \;=\; \text{average squared difference between model's score and data's score, over real samples} $$

In words. Rather than fitting the data's density directly (which would require a hard-to-compute normaliser), fit the score: the gradient of the log-density with respect to x, written ∇x log p(x). The score is a vector field pointing "uphill" in probability — toward where data is more likely. For each real training sample, compute the model's predicted score and compare it to the true data score, then average the squared error. The trick is that normalising constants drop out when you take the log-gradient, so you can train without the partition function. In practice you don't know the true data score either, but Hyvärinen's identity lets you optimise an equivalent objective, and denoising score matching estimates it by adding noise.

  • scorex log p(x) — the gradient of log-density, a vector pointing toward higher-density regions
  • model's scorethe score predicted by your neural net (parameters θ)
  • data's scorethe true gradient of log pdata(x) — not known directly, but estimable
  • average over real samplesexpectation taken over data x drawn from pdata

Energy-based models. Define p(x) ∝ exp(-Eθ(x)). Maximum likelihood requires the partition function (intractable). Trained via contrastive divergence, score matching, or variational methods. Conceptually powerful; tricky in practice.

Score-based models. Learn x log p(x) (the "score"). Sample via Langevin dynamics or related SDEs. Closely related to diffusion: a denoising network is essentially a score estimator at multiple noise levels. Unified by Song et al. 2021's SDE formulation.

Conditional generation. p(x | y) instead of p(x). Class-conditional images, text-to-image, image-to-image. Classifier-free guidance (Ho & Salimans 2021) trains a single model on conditional and unconditional examples; trade off quality and diversity at sampling time by mixing the two.

Latent diffusion. Train a VAE to compress images to a latent space; train a diffusion model in that space. Stable Diffusion is exactly this. Faster sampling (smaller latents) without losing much quality.

Likelihood vs sample quality. Not the same! A model can have great likelihood and ugly samples (over-smooth average) or beautiful samples and terrible likelihood (mode-collapsed). Pick the metric that matches what you care about.

Evaluation. No single metric is right. FID (Fréchet Inception Distance) compares feature statistics; IS (Inception Score) measures diversity + classifiability; PR (Precision/Recall in image space) decouples mode coverage from sample quality; CLIP score for text-image alignment. All have known failure modes.

import torch, torch.nn as nn

# Classifier-free guidance — train with random dropout of the condition
def sample_cfg(model, y, num_steps, guidance=7.5):
    x = torch.randn(...)
    for t in reversed(range(num_steps)):
        noise_cond   = model(x, t, y)
        noise_uncond = model(x, t, None)
        noise = noise_uncond + guidance * (noise_cond - noise_uncond)
        x = denoise_step(x, noise, t)
    return x
Want SDE formulation, consistency models, & ImageNet scaling?

Score SDE

$$ dx = \big[\, f(x, t) - g(t)^2 \nabla_x \log p_t(x)\, \big]\, dt + g(t)\, d\bar W $$

  • Reverse-time SDE for sampling — needs the score ∇log pt(x)
  • Unifies diffusion, score-based, and Langevin samplers
  • Score networks trained at multiple noise levels approximate ∇log pt

$$ dx \;=\; \big[\, \text{drift}(x, t) \;-\; \text{diffusion}(t)^2 \times \text{score}(x, t)\, \big]\, dt \;+\; \text{diffusion}(t)\, d\bar W $$

In words. This is the reverse-time stochastic differential equation that turns noise back into data. dx is the tiny change in x at each step of integration. The bracketed part is the deterministic pull: a built-in drift term f(x, t), minus the squared diffusion coefficient g(t) times the score — the gradient of log-density at the current noise level, which points uphill toward more-likely x. The final term g(t)·d̅W injects a controlled amount of random noise at every step (d̅W is the reverse Wiener-process increment — formal mathematical jargon for "infinitesimal Gaussian noise"). Start from pure Gaussian noise, integrate this equation backwards in time, and you get a sample from the data distribution. Diffusion, score-based models, and Langevin sampling are all instances of this with different choices of f and g.

  • driftf(x, t) — the deterministic pull, often a simple linear function of x
  • diffusiong(t) — how much noise to inject at time t
  • scorex log pt(x) — gradient of log-density at the current noise level, approximated by a neural net
  • d̅Wreverse-time noise increment — mathematical name for "a small Gaussian random kick"
  • ttime, running from noise (t=T) back to data (t=0) during sampling

Consistency models. Song et al. (2023). Train a one-step distillation of a diffusion model — sample in 1 to 4 steps instead of 50–1000. Trade-off: somewhat lower quality than the multi-step teacher; orders-of-magnitude faster.

Flow matching & rectified flow. Lipman et al. (2023), Liu et al. (2022). Train a vector field that maps noise to data; sample by solving the ODE. Often faster than diffusion at comparable quality. The frontier of "diffusion done better".

Discrete diffusion & masked language models. The diffusion framework generalises beyond Gaussian noise — for discrete data (text, tokens), use absorbing-state diffusion or masked modelling. Mask-and-predict objectives are the conceptual cousin.

Autoregressive scaling. Most modern foundation models are autoregressive over discrete tokens, including over images (Parti, MaskGIT, ImageGPT). Tokenise the image with a VQ-VAE or similar, then run an autoregressive transformer over the tokens. Slower at inference than parallel diffusion but conceptually simpler.

Likelihood-based vs adversarial. Likelihood-based models (AR, diffusion, flows, VAEs) tend to cover all modes but produce blurry/smoothed samples. Adversarial models (GANs) produce sharp samples but miss modes. Hybrids (VAE-GAN, diffusion-GAN) try to combine.

Controllable generation. ControlNet, T2I-Adapter, LoRA, IP-Adapter, image conditioning, depth-conditioning, prompt-engineering — the modern stack for steering diffusion models. Less a "model family" than a meta-pattern of adding more conditioning signals.

Evaluation difficulties. FID is the de facto standard but can disagree with human judgement. CLIP-IQA, DINOv2-FID, and human evaluation (preference models) are alternative measures. Always look at samples, not just numbers.

import torch
import torch.nn.functional as F

# Consistency-model-style one-step generation
def one_step_sample(consistency_model, noise):
    return consistency_model(noise, t=1.0)   # one forward pass

# Flow matching — train a velocity field
def flow_loss(model, x_data, sigma=1.0):
    t = torch.rand(x_data.size(0))
    x0 = sigma * torch.randn_like(x_data)
    x_t = (1 - t.view(-1, 1)) * x0 + t.view(-1, 1) * x_data
    target_velocity = x_data - x0
    return F.mse_loss(model(x_t, t), target_velocity)
Too dense?