Mode

Key idea

Generate data by drawing a few random numbers, then mapping them up with a straight line plus noise. Pick a latent point z — a short list of numbers we'll call the datapoint's code — from a standard Gaussian blob, stretch and rotate it through a matrix W, shift it, and sprinkle on a little isotropic noise — out comes a sample that looks like your data. That single recipe, x = Wz + μ + ε, is Probabilistic PCA. Everything else in this section keeps the recipe and replaces the straight line with something more expressive.

Sample a 1-D code, decode it onto the line, add round noise — PPCA generating data you can tune
σ = 0.00

The dashed orange ellipse is the Gaussian PPCA fits to the indigo data — that's the distribution we sample from. Hit Generate to watch a code get drawn from the fixed bell curve, decoded onto the principal line, and noised into an orange sample inside the ellipse. Drag noise σ down to the plain PCA limit (the ellipse collapses onto the line) or past the MLE mark (a fat round blob); at the MLE mark the ellipse lines up with the data. Switch to Two clusters to see why one straight axis can't capture multi-modal data — the motivation for everything that follows.

You already met PCA as a dimensionality-reduction tool: find the directions of greatest variance, project onto them. That's the encoder half of the story. Here we turn it around and ask the generative question: what process could have produced this data?

The generative view. Imagine each data point started life as a handful of latent numbers z — its code. That word does a lot of work on this page, so pin it down: a code is a short list of numbers — the datapoint's coordinates in a small, hidden space, far fewer numbers than the datapoint itself has. (For face images, two code numbers might stand for "how much it smiles" and "head tilt".) A linear map W decodes that code — expands it back into the full high-dimensional datapoint — then we add the mean μ, and reality adds a bit of measurement noise. Run that forward and you've sampled a new datapoint. Learn W and the noise level from data and you've fit a generative model.

Why the top axis is the informative one. Shlens' classic tutorial makes this click with a spring: a ball bounces along one hidden axis, but you film it with a few cameras at careless angles, so your raw recording is a redundant, rotated, noisy pile of numbers. All the real motion is one-dimensional — and PCA rediscovers it as the direction of greatest variance, discarding the redundancy between cameras. That direction is the axis of motion, and in the generative story above the ball's displacement along it is exactly the code. There's an interactive spring visualisation on the dimensionality-reduction PCA page that lets you watch PCA recover that axis of motion from a tilted camera's redundant measurements.

Why start here. PCA is the "hydrogen atom" of generative modelling — the one case where everything is linear and Gaussian, so every quantity has a closed form. There's no adversary, no sampling loop, no intractable integral. Once you see generation as latent code → decoder → data, the VAE is just "make the decoder a neural net", the GAN is "drop the likelihood and train a critic", and diffusion is "make the decoder a long denoising chain".

The catch. A straight line can only produce ellipsoidal Gaussian blobs. Real data lives on curved, multi-modal manifolds — faces, audio, language. The rest of this section is the story of replacing that line with curves while keeping the same "sample a latent, decode it" skeleton.

The generative framing helps when

  • You want a probabilistic model with a real likelihood (model selection, anomaly scores)
  • Data is roughly Gaussian / ellipsoidal — PPCA is the right model, not just a baseline
  • You need to handle missing values (the latent model imputes them naturally)
  • You want to understand VAEs — PPCA is the exact special case to anchor on

It breaks down when

  • Data is multi-modal or lives on a curved manifold (a line can't bend)
  • You need sharp, realistic samples — Gaussian noise blurs everything
  • The interesting structure is non-linear (use a VAE or diffusion)
  • You only want the projection, not a generative story — then plain PCA is simpler

import numpy as np

# Probabilistic PCA as a *generator*: sample latent -> decode -> add noise
# (W, mu, sigma learned from data; here we just sample from a fitted model)
def sample_ppca(W, mu, sigma, n):
    k = W.shape[1]                      # latent dimension
    z = np.random.randn(n, k)           # 1. draw codes from N(0, I)
    x = z @ W.T + mu                    # 2. decode with the linear map
    x += sigma * np.random.randn(*x.shape)   # 3. add isotropic noise
    return x

# Fitting is just PCA: top-k eigenvectors of the covariance give W,
# and the *leftover* variance becomes the noise level sigma^2.
Want the actual likelihood, the EM fit, and the exact VAE connection?

The PPCA model

$$ z \sim \mathcal{N}(0, I), \qquad x \mid z \sim \mathcal{N}(Wz + \mu,\; \sigma^2 I) $$

  • Latent prior is a unit Gaussian — a featureless blob of codes
  • Decoder Wz + μ is linear; noise is isotropic with a single scale σ²
  • Marginal likelihood is closed-form: x ~ N(μ, WWᵀ + σ²I)

$$ \text{code} \sim \text{unit Gaussian}, \qquad \text{data} = \text{decoder}(\text{code}) + \text{noise} $$

In words. Two steps. First, draw a low-dimensional code z from a plain unit Gaussian — no structure, just a round cloud of points centred at zero. Second, push that code through a decoder that here is just a matrix multiply plus a shift (Wz + μ), then add a dab of round noise of size σ. Because both steps are Gaussian and the map is linear, the data you generate is itself exactly Gaussian, with covariance WWᵀ + σ²I — the part WWᵀ is the structured spread the model captured, and σ²I is the leftover wiggle it couldn't be bothered to explain. That clean formula is the whole reason PPCA is solvable by hand.

  • codethe latent vector z, drawn from N(0, I) — usually far fewer dims than the data
  • decoderthe linear map Wz + μ; W's columns are the directions the code stretches along
  • noiseisotropic Gaussian of scale σ — the same in every direction
  • WWᵀ + σ²Ithe resulting data covariance — structure plus uniform leftover variance

PCA falls out of the limit. As the noise σ² → 0, the maximum-likelihood W spans exactly the top-k principal subspace — the same eigenvectors classical PCA gives you. So PCA is PPCA with the noise turned off. Keeping σ² finite is what buys you a proper probability density.

You get a real likelihood. Because x ~ N(μ, WWᵀ + σ²I), you can evaluate log p(x) for any point. That powers things classical PCA can't do: principled model selection (how many components?), anomaly detection (low likelihood = outlier), and comparing fits across datasets.

Posterior over codes. Given a datapoint, the posterior p(z | x) is again Gaussian — a closed-form encoder. This is exactly the object a VAE has to approximate with a neural network because its decoder is non-linear and the integral stops being tractable.

Fit by EM (or eigen-decomposition). Tipping & Bishop (1999) gave both: a direct solution via the data covariance's eigenvectors, and an EM algorithm that scales better and handles missing data — E-step infers codes, M-step updates W and σ².

Factor analysis is the cousin. Allow the noise to differ per feature (a diagonal Ψ instead of σ²I) and you get factor analysis. PPCA is the special case where every feature shares one noise level.

The Gaussian PPCA fits to thousands of real handwritten digits — walk its axes, or sample from it

This is PPCA fit to ~7,000 real handwritten 3s (change the digit above). The bottom strip walks a single eigen-axis from −3σ to +3σ — exactly the traversal the accompanying Python script draws. Sample from Gaussian draws all twelve coefficients from N(0, 1): the results are recognisably digit-ish but ghostly, because a Gaussian is far too simple to capture the true density of images. That failure is the motivation for the whole section.

import numpy as np
from sklearn.decomposition import PCA

# scikit-learn's PCA exposes the PPCA likelihood directly.
pca = PCA(n_components=10).fit(X)        # X: (n_samples, n_features)

# score() returns the average log-likelihood under the Gaussian PPCA model
print("avg log-likelihood:", pca.score(X))

# The estimated isotropic noise variance sigma^2 (mean of discarded eigenvalues)
print("noise variance:", pca.noise_variance_)

# Likelihood-based model selection: which k generalises best?
for k in [2, 5, 10, 20]:
    ll = PCA(n_components=k).fit(X_train).score(X_val)
    print(f"k={k:2d}  val log-likelihood={ll:.2f}")
Want the marginal-likelihood derivation, the rotation ambiguity, and where the line breaks?

Marginal likelihood

$$ p(x) = \int p(x \mid z)\, p(z)\, dz = \mathcal{N}\!\big(x \;\big|\; \mu,\; C\big), \qquad C = WW^\top + \sigma^2 I $$

  • Integrating out z is tractable because everything is Gaussian — a rare luxury
  • The MLE for W is Ukk − σ²I)½R — top eigenvectors, scaled, times any rotation R
  • The free rotation R is why PPCA axes aren't individually identified

$$ p(\text{data}) = \text{average over all codes of } p(\text{data} \mid \text{code}) = \text{one big Gaussian with covariance } WW^\top + \sigma^2 I $$

In words. To score a datapoint without knowing its code, you average the decoder's output probability over every possible code, weighted by the prior — that's the integral. For the VAE this integral is hopeless and must be bounded by the ELBO; for PPCA it collapses to a single Gaussian because a Gaussian pushed through a linear map and blurred by Gaussian noise stays Gaussian. The covariance WWᵀ + σ²I says it all: structure captured by the decoder plus uniform leftover noise. The price of linearity is the catch in the next paragraph — the solution is only pinned down up to an arbitrary rotation of the latent axes.

  • average over all codesmarginalising the latent z against its prior — the integral
  • one big Gaussianthe result stays Gaussian only because the decoder is linear
  • rotation Ryou can spin the latent axes freely without changing p(x) — they aren't unique
  • σ²recovered as the average of the eigenvalues you didn't keep

The exact bridge to VAEs. A VAE is PPCA with two upgrades: (1) the decoder Wz + μ becomes a neural network fθ(z), so the data manifold can curve; (2) the now-intractable posterior p(z | x) is approximated by an encoder network qφ(z | x). Train by maximising the ELBO — a lower bound on the same log p(x) PPCA computes exactly. Set the networks to be linear and the ELBO becomes tight: you recover PPCA. See the VAE page for the full derivation. This is the reason this section opens with PCA.

Rotation ambiguity. Because C = WWᵀ + σ²I is unchanged if you replace W with WR for any orthogonal R, the latent axes aren't individually identifiable — only the subspace is. This is harmless for density modelling but means PPCA components aren't "interpretable" the way people sometimes hope. (Factor-analysis rotations like varimax exploit exactly this freedom.)

Why a line is not enough. PPCA can only ever produce one ellipsoidal blob. Two well-separated clusters? A spiral? A ring (like the viz on the overview page)? A single Gaussian smears straight across the gap. The fix is non-linearity — and that's the entire motivation for the neural decoders in the rest of the section. A mixture of PPCAs patches multi-modality cheaply; a VAE handles curvature directly.

Missing data, for free. Because it's a proper latent-variable model, PPCA handles partially-observed inputs by marginalising the missing entries in the E-step — a genuinely useful capability classical PCA lacks.

Cost. Closed-form via SVD is O(nd·min(n,d)); EM is O(ndk) per iteration and shines when k ≪ d or data is missing. Both are trivial next to training any neural generative model — the reason PPCA is still the first thing to try when the data might actually be Gaussian.

import numpy as np

# EM for PPCA — the algorithm a VAE generalises when the decoder goes non-linear.
def ppca_em(X, k, iters=100):
    n, d = X.shape
    mu = X.mean(0)
    Xc = X - mu
    W = np.random.randn(d, k)
    sigma2 = 1.0
    for _ in range(iters):
        # E-step: posterior over codes  q(z|x) = N(M^-1 W^T xc, sigma2 M^-1)
        M = W.T @ W + sigma2 * np.eye(k)
        Minv = np.linalg.inv(M)
        Ez = Xc @ W @ Minv                         # expected codes
        Ezz = n * sigma2 * Minv + Ez.T @ Ez        # expected outer products
        # M-step: re-fit the linear decoder and the noise level
        W = (Xc.T @ Ez) @ np.linalg.inv(Ezz)
        sigma2 = (np.sum(Xc**2)
                  - 2*np.sum(Ez @ W.T * Xc)
                  + np.trace(Ezz @ (W.T @ W))) / (n * d)
    return W, mu, sigma2
Too dense?