You already met PCA as a dimensionality-reduction tool: find the directions of greatest variance, project onto them. That's the encoder half of the story. Here we turn it around and ask the generative question: what process could have produced this data?
The generative view. Imagine each data point started life as a handful of latent numbers z — its code. That word does a lot of work on this page, so pin it down: a code is a short list of numbers — the datapoint's coordinates in a small, hidden space, far fewer numbers than the datapoint itself has. (For face images, two code numbers might stand for "how much it smiles" and "head tilt".) A linear map W decodes that code — expands it back into the full high-dimensional datapoint — then we add the mean μ, and reality adds a bit of measurement noise. Run that forward and you've sampled a new datapoint. Learn W and the noise level from data and you've fit a generative model.
Why the top axis is the informative one. Shlens' classic tutorial makes this click with a spring: a ball bounces along one hidden axis, but you film it with a few cameras at careless angles, so your raw recording is a redundant, rotated, noisy pile of numbers. All the real motion is one-dimensional — and PCA rediscovers it as the direction of greatest variance, discarding the redundancy between cameras. That direction is the axis of motion, and in the generative story above the ball's displacement along it is exactly the code. There's an interactive spring visualisation on the dimensionality-reduction PCA page that lets you watch PCA recover that axis of motion from a tilted camera's redundant measurements.
Why start here. PCA is the "hydrogen atom" of generative modelling — the one case where everything is linear and Gaussian, so every quantity has a closed form. There's no adversary, no sampling loop, no intractable integral. Once you see generation as latent code → decoder → data, the VAE is just "make the decoder a neural net", the GAN is "drop the likelihood and train a critic", and diffusion is "make the decoder a long denoising chain".
The catch. A straight line can only produce ellipsoidal Gaussian blobs. Real data lives on curved, multi-modal manifolds — faces, audio, language. The rest of this section is the story of replacing that line with curves while keeping the same "sample a latent, decode it" skeleton.