Mode

Key idea

Take the PPCA recipe and bend the line. Probabilistic PCA generates data by drawing a code z from a unit Gaussian and pushing it through a straight line Wz + μ, plus noise. A VAE keeps that exact skeleton — sample a code, decode it — but replaces the straight line with a neural network so the data manifold can curve. The price: the tidy closed-form posterior PPCA enjoyed is now intractable, so a second network — the encoder — learns to approximate it. Train both by maximising the ELBO, a lower bound on the same log p(x) PPCA computes exactly.

Task Compress the 12×12 image on the left through a narrow bottleneck of just two numbers (the orange dot's position in the middle), then reconstruct the image on the right. Click a preset to encode it; drag the orange dot to roam the latent space; toggle VAE mode for stochastic codes. The reconstruction error tells you what the code couldn't keep.
Encoder · 2-D latent · decoder — drag the dot to decode any code
Input:

Toggle VAE mode on and watch the single latent point become a fuzzy cloud of codes — the encoder now emits a distribution, and the KL term pulls every cloud toward the shared unit Gaussian so the gaps between codes decode to something sensible. That's what makes the space samplable.

PPCA was the "hydrogen atom": linear, Gaussian, solvable by hand. But a straight line can only ever produce one ellipsoidal blob. Faces, digits, audio live on curved manifolds. The VAE is the smallest possible upgrade that lets the manifold bend — and it earns its name from how it copes with the fallout.

The three pieces. An encoder q(z|x) reads a datapoint and outputs a distribution over codes — a mean and a spread — rather than a single point. A latent code z is sampled from that distribution. A decoder p(x|z) — now a neural net, not a matrix — maps the code back up to data space. Encoder and decoder are trained jointly so that codes round-trip back to their inputs.

Why a distribution, not a point? This is the whole trick. A plain autoencoder squeezes each input to a single point, and the gaps between those points decode to garbage — there's no way to sample. By forcing the encoder to emit a fuzzy cloud and pulling every cloud toward one shared unit-Gaussian prior, the VAE fills the latent space in. Afterwards you can draw any z ~ N(0, I), decode it, and get something plausible. That's exactly the PPCA generative story, now with a curved decoder.

The reparameterisation trick, in words. You can't backpropagate through "draw a random sample". So instead of sampling z directly, you sample a fixed bit of noise ε from a standard Gaussian and build the code as z = mean + spread × ε. The randomness now lives in ε, which has no parameters, while gradients flow cleanly through the mean and spread the encoder produced. Same samples, differentiable path.

Fast but blurry. Generation is a single forward pass through the decoder — far faster than a GAN's adversarial dance or diffusion's long denoising chain. The catch is softness: the Gaussian likelihood rewards a blurry average over plausible outputs rather than committing to one crisp one. VAEs trade sharpness for speed, a stable likelihood objective, and a navigable latent space.

Reach for a VAE when

  • You want PPCA's probabilistic latent-variable story but the data manifold is curved
  • A fast, single-pass generator matters and you can tolerate some blur
  • You need a smooth, navigable latent space — interpolation, attribute arithmetic
  • You want a learned compressor to feed a downstream model (latent diffusion, autoregressive tokens)

It breaks down when

  • Razor-sharp samples are the goal — the Gaussian likelihood blurs (use a GAN or diffusion)
  • The data really is Gaussian — then plain PPCA is exact and far cheaper
  • You need exact likelihoods — the ELBO is only a lower bound
  • A strong decoder ignores the latent entirely (posterior collapse — see In-depth)

import torch, torch.nn as nn

# A VAE is PPCA with a *neural* decoder and a *learned* encoder.
# Sample a code from N(0, I), decode it -> a new datapoint (one forward pass).
@torch.no_grad()
def sample_vae(decoder, n, d_latent):
    z = torch.randn(n, d_latent)   # 1. draw codes from the unit-Gaussian prior
    return decoder(z)              # 2. decode with a NEURAL NET (not a matrix)

# Compare to PPCA: there the decoder was just `z @ W.T + mu`.
# Make `decoder` linear and a VAE collapses straight back to PPCA.
Want the ELBO, the reparameterisation trick spelled out, and the PPCA bridge?

The training objective (ELBO)

$$ \mathcal{L}(\theta, \phi; x) = \underbrace{\mathbb{E}_{q_\phi(z\mid x)}\!\big[\log p_\theta(x \mid z)\big]}_{\text{reconstruction}} \;-\; \underbrace{D_{\mathrm{KL}}\!\big(q_\phi(z \mid x) \,\big\|\, p(z)\big)}_{\text{regulariser}} $$

  • qφ(z|x) — the encoder, outputting a Gaussian N(μφ(x), σφ²(x)) over codes
  • pθ(x|z) — the decoder network; the likelihood of reconstructing x from z
  • p(z) = N(0, I) — the same unit-Gaussian prior PPCA used
  • Maximise : it is a lower bound on log p(x) (derived in the In-depth tier)

$$ \text{ELBO} = \underbrace{\text{how well codes reconstruct the input}}_{\text{reconstruction}} \;-\; \underbrace{\text{how far the encoder drifts from the prior}}_{\text{regulariser}} $$

In words. The objective pits two pressures against each other. The reconstruction term rewards round-tripping: sample a code from the encoder's cloud given x, decode it, and check how close the result is to the original x. Left alone, that pressure would scatter codes anywhere convenient — leaving holes you can't sample from. So the regulariser, a KL divergence, penalises every per-input cloud for drifting away from the shared prior (a plain unit Gaussian). The balance packs all the codes into one smooth Gaussian-shaped region, so that after training you can draw a fresh code from the prior and the decoder knows what to do with it. Maximising the whole expression — the ELBO, or Evidence Lower BOund — provably pushes up the model's true log-likelihood.

  • reconstructionexpected log-likelihood of decoding the input back from its own code
  • regulariserKL distance from the encoder's per-input cloud to the unit-Gaussian prior
  • codethe latent z, sampled from the encoder's distribution — fewer dims than the data
  • priorN(0, I) — the round cloud you sample from at generation time

What q(z|x) actually is — and why it's the thing we plug in. The encoder is a network that reads x and outputs just two vectors: a mean μ and a spread σ (usually emitted as log σ², to keep it positive). Those two vectors are q(z|x) — the Gaussian N(μ, σ²) over codes for this input. There's nothing else to it; the whole "cloud" is just those numbers, and μ = μφ(x), σ = σφ(x) change from input to input. Both halves of the ELBO are then computed straight from μ and σ:

  • the KL term is a closed-form function of μ and σ alone — it's what pulls μ → 0 and σ → 1;
  • the reconstruction term draws a code z = μ + σ ⊙ ε from them (the reparameterisation trick), decodes it, and scores log p(x|z).

So training is nothing more than nudging the encoder's μ, σ outputs (and the decoder's weights) to push the ELBO up. That is the sense in which q(z|x) is what we "measure" for each input and feed into the loss.

Reading the objective, term by term. Two terms, pulling in opposite directions.

The reconstruction term, Eq(z|x)[ log p(x|z) ] — why an average? The encoder doesn't hand you a single code for x; it hands you a whole cloud of plausible codes q(z|x). So "how well does x come back?" can't be read off one code — you average over the cloud: draw a code z from the encoder, decode it, and ask how probable the original x is under the decoder (that's log p(x|z)), then average that over all the codes the encoder finds plausible. Taking the expectation is what ties the loss to how the model actually generates — sample a code, decode it — rather than to one lucky code. (In practice a single code drawn per step is enough to estimate the average; making that draw differentiable is exactly what the reparameterisation trick below is for.) Scoring only the mean code would ignore that the encoder is deliberately fuzzy.

The regulariser, KL( q(z|x) ‖ p(z) ) — what it measures and why it's there. This is the distance from the encoder's cloud for x to the shared prior N(0, I): it grows as the cloud's mean drifts from 0 or its width drifts from 1, and is zero only when the cloud is the prior. Why penalise that at all? Because the reconstruction term, left to itself, would shove each input's cloud into its own private corner and shrink it to a spike — superb for reconstruction, useless for generation, since the prior you sample from at generation time would then sit over empty space. The KL is the counter-pressure that keeps every cloud piled near the one shared prior, so the latent space stays gap-free and samplable. The latent-space viz further down makes that tug-of-war visible.

Aside: how do you compare two distributions? The KL divergence. Both the regulariser and the bound's "gap" are KL divergences, so it's worth a line on what that is. KL(q ‖ p) measures how different two distributions are — a directed "distance" from q to p. Concretely: draw samples from q, and at each one look at the log-ratio of how likely q thinks it is versus p, then average —

$$ D_{\mathrm{KL}}(q \,\|\, p) = \mathbb{E}_{z \sim q}\!\left[\, \log \frac{q(z)}{p(z)} \,\right]. $$

If q and p agree everywhere the ratio is 1, its log is 0, and the KL is zero. The more they disagree — q placing mass where p has little — the larger it grows. Two facts do all the work for us:

  • It is never negative, and it is zero only when the two distributions are identical. That single property is what makes the ELBO a genuine lower bound: the gap is a KL, so it can never dip below zero.
  • It is asymmetricKL(q ‖ p) ≠ KL(p ‖ q) — which is why we call it a divergence, not a true distance.

For two Gaussians — our encoder's cloud and the prior — the whole expectation collapses to a short closed-form formula in their means and variances (no sampling needed), which is why the regulariser is so cheap to compute. And you can feel it in the bound viz above: as you drag q onto the posterior and their two curves come to coincide, the KL gap collapses to zero.

Why is it a lower bound — and where does it come from? Training by maximum likelihood means pushing up log p(x), the log-probability the model assigns to the data. But p(x) averages the decoder over every possible code — p(x) = ∫ p(x|z)·p(z) dz — and for a neural decoder that integral is hopeless: draw codes from the prior and almost all of them decode to nothing like x, so the average is impossible to estimate. We can't maximise what we can't compute.

The move. Bring in the encoder's guess q(z|x) about which codes could have produced x. For any choice of q, one line of algebra splits the evidence exactly in two:

$$ \log p(x) = \underbrace{\mathcal{L}(q)}_{\text{ELBO — a floor we can compute}} \;+\; \underbrace{D_{\mathrm{KL}}\!\big(q(z\mid x)\,\|\,p(z\mid x)\big)}_{\text{gap} \,\ge\, 0}. $$

The second term is a KL divergence — a distance — so it can never be negative. That forces the ELBO to sit at or below log p(x): a floor under the evidence. That is the entire meaning of "Evidence Lower BOund".

What the gap actually is. The distance from the floor up to the true evidence is exactly how wrong the encoder's guess q is about the true posterior p(z|x) — the ideal distribution over codes for this input. Sharpen q and the floor rises; when q matches the true posterior the gap hits zero and the bound touches log p(x).

Why that's a good deal. We still can't see log p(x) — but we can compute the ELBO, because rearranged it is exactly the reconstruction − KL-to-prior objective above (both terms tractable). So pushing the ELBO up does two good things at once: it lifts our floor toward the real evidence, and it drags q toward the true posterior. Maximise the thing you can compute, and the thing you actually want comes along for free.

The ELBO is a floor under the evidence — improve the encoder's guess and it rises to meet log p(x)
gap …

Drag the orange encoder q toward the indigo true posterior (or hit Match) and watch the KL gap close and the ELBO floor rise to meet log p(x). Widen or narrow q with the slider — matching the posterior's width matters just as much as its centre. In a real VAE you never get to see the posterior or the evidence; you only ever push up the floor, trusting that it drags everything else with it.

What's actually out of reach. The encoder, the decoder, the reconstruction score and the KL are all ordinary computations — a couple of forward passes and a closed-form formula. Only two quantities are intractable: the evidence p(x) and the true posterior p(z|x), because both would need the decoder integrated over every possible code. That's the whole reason for the ELBO — it's assembled from the tractable pieces yet bounds the intractable evidence, and the encoder qφ is a trained stand-in for the intractable posterior (the gap in the picture above is how good a stand-in it is).

The reparameterisation trick. The encoder emits a mean μ and a (log-)variance, so the spread σ. You need a sample z ~ N(μ, σ²), but sampling isn't differentiable. The fix is to factor the randomness out: draw ε ~ N(0, I) and set

$$ z = \mu + \sigma \odot \varepsilon. $$

Now z is a deterministic, differentiable function of μ and σ; the only stochastic input ε carries no parameters, so gradients flow back into the encoder. Without this trick the VAE simply couldn't be trained by backprop.

The KL term is closed-form. Because both qφ(z|x) and the prior are Gaussian, the regulariser has an exact formula — no sampling needed: $D_{\mathrm{KL}} = -\tfrac12 \sum_j \big(1 + \log\sigma_j^2 - \mu_j^2 - \sigma_j^2\big)$. It is minimised when μ → 0 and σ → 1, i.e. when each code-cloud matches the standard normal. That is precisely the pull that makes the latent space samplable.

Why this gives PPCA back. PPCA's posterior p(z|x) was Gaussian and known in closed form because the decoder was linear. The VAE's qφ is the amortised stand-in for that posterior — one network that infers codes for any input. Make the encoder and decoder linear with a Gaussian output and the ELBO becomes tight: the VAE reduces exactly to PPCA. Everything new is the cost of letting the decoder curve.

Pick the reconstruction loss by data type. MSE for continuous pixels or audio (a Gaussian output noise model — the direct heir of PPCA's σ²I); per-pixel binary cross-entropy for [0,1] Bernoulli outputs (the original paper's MNIST choice); ordinary cross-entropy for discrete tokens.

β-VAE and disentanglement. Scale the KL term by a factor β: ℒ = E[log p(x|z)] − β·KL. With β > 1 the prior pressure rises and each latent axis is nudged to carry one independent factor of variation (pose, lighting, identity) — "disentanglement". It's an inductive bias, not a guarantee. The Autoencoders page covers the regularised variants (sparse, denoising, contractive) in depth.

Seeing the KL pull — why the codes must stay close together. The formula says the KL term is smallest when a cloud has μ → 0 and σ → 1 — i.e. when it sits right on top of the shared prior. Do that for every input and all the code-clouds pile into one region with no gaps between them. That's the point: you generate by drawing a code from the prior and decoding it, so a code only produces something real if it lands where the encoder actually placed training codes. Drag the pressure below.

Why every code-cloud must hug the prior — or you can't sample
β = 0.15

Watch the two meters fight: reconstruction wants the clouds spread apart and tiny (β → 0); sampling wants them packed onto the prior (β up). The ELBO's balance is the sweet spot in the middle — good reconstruction and a latent space you can actually draw from.

import torch, torch.nn as nn
import torch.nn.functional as F

class VAE(nn.Module):
    def __init__(self, d_in=784, d_latent=32):
        super().__init__()
        self.enc       = nn.Sequential(nn.Linear(d_in, 256), nn.ReLU())
        self.fc_mu     = nn.Linear(256, d_latent)   # encoder mean
        self.fc_logvar = nn.Linear(256, d_latent)   # encoder log-variance
        self.dec       = nn.Sequential(nn.Linear(d_latent, 256), nn.ReLU(),
                                       nn.Linear(256, d_in))  # NEURAL decoder

    def reparameterize(self, mu, logvar):
        std = torch.exp(0.5 * logvar)
        return mu + std * torch.randn_like(std)      # z = mu + sigma * eps

    def forward(self, x):
        h = self.enc(x)
        mu, logvar = self.fc_mu(h), self.fc_logvar(h)
        z = self.reparameterize(mu, logvar)
        return self.dec(z), mu, logvar

def elbo_loss(x_hat, x, mu, logvar):
    rec = F.mse_loss(x_hat, x, reduction="sum")           # reconstruction term
    kl  = -0.5 * (1 + logvar - mu.pow(2) - logvar.exp()).sum()  # KL to N(0, I)
    return rec + kl    # minimise (rec + kl)  ==  maximise the ELBO
Want the ELBO derived as a bound, posterior collapse, VQ-VAE, and the exact PPCA limit?

ELBO as a lower bound on log p(x)

$$ \log p(x) = \underbrace{\mathbb{E}_{q_\phi(z\mid x)}\!\big[\log p_\theta(x \mid z)\big] - D_{\mathrm{KL}}\!\big(q_\phi(z\mid x)\,\|\,p(z)\big)}_{\text{ELBO}} \;+\; \underbrace{D_{\mathrm{KL}}\!\big(q_\phi(z\mid x)\,\|\,p_\theta(z\mid x)\big)}_{\ge\,0\ \text{— the gap}} $$

  • The same log p(x) PPCA evaluates in closed form — here it is intractable
  • It splits exactly into the ELBO plus the KL from qφ to the true posterior
  • That KL is ≥ 0, so the ELBO is a genuine lower bound; the gap shuts as qφ → pθ(z|x)

$$ \log p(\text{data}) = \underbrace{\text{ELBO}}_{\text{what we maximise}} \;+\; \underbrace{\text{distance from encoder to the TRUE posterior}}_{\ge\,0\ \text{— invisible gap}} $$

In words. This identity is what makes VAE training principled. The quantity you actually care about — log p(data), how well the model explains the data, the very thing PPCA computes by hand — splits exactly into two pieces: the ELBO you can optimise, plus the KL distance from the encoder's guess to the true posterior (the perfect distribution over codes given the data). Because a KL distance is never negative, the ELBO can only ever sit below log p(data), so maximising it is safe. And it does double duty: pushing the ELBO up both fits the data better and squeezes the encoder closer to the true posterior. The true posterior is exactly the object PPCA had in closed form and the curved decoder destroyed — so here we never compute the gap; we just optimise the bound.

  • log p(data)the marginal likelihood — intractable here, exact for PPCA
  • ELBOthe bound we maximise: reconstruction minus KL-to-prior
  • true posteriorpθ(z|x) — what the encoder approximates; closed-form only when the decoder is linear
  • invisible gapthe KL between encoder and true posterior — never computed, only bounded

Linear networks ⇒ PPCA, exactly. Strip the non-linearities: let the encoder and decoder be affine maps and the decoder's output noise be isotropic Gaussian. Then qφ(z|x) can represent the true Gaussian posterior, the invisible gap above closes to zero, the bound becomes tight, and the ELBO is maximised by precisely the PPCA solution — W spanning the top principal subspace. The VAE is, quite literally, PPCA with the linear maps replaced by neural networks and the now-unreachable posterior amortised into an encoder. That is the spine of this whole arc.

Blurriness is the Gaussian likelihood. An MSE reconstruction loss assumes the decoder output is the mean of a Gaussian. When several sharp outputs are all plausible for one code, the likelihood-maximising choice is their average — and an average of crisp images is a blurry one. This is intrinsic to the maximum-likelihood objective, not a bug; it's exactly the trade the next page, GANs, refuses to make — abandoning likelihood for a critic that punishes blur directly.

Posterior collapse. Give the decoder enough power (say an autoregressive PixelCNN) and it can model the data without the code. The encoder then takes the free lunch: it sets qφ(z|x) = p(z), driving the KL term to zero and leaving the latent carrying no information. Standard fixes: KL annealing (ramp the KL weight up from zero), free bits (don't penalise KL below a per-dimension floor), or deliberately weakening the decoder.

VQ-VAE — discrete latents. Replace the Gaussian code with the nearest vector in a learned codebook: the encoder emits a continuous vector, you snap it to the closest entry, the decoder reconstructs from that. The discrete bottleneck sidesteps posterior collapse and produces tokens, which is why VQ-VAE is the image/audio tokeniser feeding autoregressive transformers (DALL·E v1, Parti) and most multi-modal LLMs.

Latent diffusion — the VAE as compressor. The dominant production role for VAEs today is not generating directly. Stable Diffusion (Rombach et al., 2022) trains a VAE to compress images to a small latent grid, then runs a diffusion model in that space rather than over pixels. The VAE handles dull perceptual compression; diffusion handles the hard semantic generation — and the VAE's blurry single-pass decoder becomes a feature, not a flaw. This links the arc forward: the next-but-one page is diffusion, and a VAE is what sits underneath it.

import torch, torch.nn as nn
import torch.nn.functional as F

# Posterior-collapse defences: KL annealing + free bits
def vae_loss_robust(x_hat, x, mu, logvar, beta=1.0, free_bits=0.05):
    rec = F.mse_loss(x_hat, x, reduction="sum")
    # KL per latent dim, floored so we don't punish small but useful codes
    kl_per_dim = -0.5 * (1 + logvar - mu.pow(2) - logvar.exp())
    kl = (kl_per_dim - free_bits).clamp(min=0).sum()
    return rec + beta * kl

def kl_anneal(step, warmup=5000):
    return min(1.0, step / warmup)   # ramp beta 0 -> 1 over warmup steps

# The PPCA limit: a *linear* VAE. With these affine maps and a Gaussian
# output, maximising the ELBO recovers Probabilistic PCA exactly.
class LinearVAE(nn.Module):
    def __init__(self, d_in, d_latent):
        super().__init__()
        self.fc_mu     = nn.Linear(d_in, d_latent)      # linear encoder
        self.fc_logvar = nn.Linear(d_in, d_latent)
        self.dec       = nn.Linear(d_latent, d_in)      # linear decoder == W z + mu
Too dense?