Mode

Key idea

Squeeze data through a narrow channel, then rebuild it. An encoder compresses the input to a small code; a decoder tries to reconstruct the original from that code alone. The bottleneck has nowhere to hide redundancy — only the features that actually help reconstruction survive.

Task Compress the 12×12 image on the left through a narrow bottleneck of just two numbers (the orange dot's position in the middle), then reconstruct the image on the right. Click a preset to encode it; drag the orange dot to roam the latent space; toggle VAE mode for stochastic codes. The reconstruction error tells you what the code couldn't keep.
Encoder · 2-D latent · decoder — drag the dot to decode any code
Input:

Why the bottleneck matters

If the code were as wide as the input, the network could just copy values through and learn nothing — the trivial identity function. The bottleneck makes that impossible: only a fixed number of dimensions get to pass, so the encoder has to choose what's worth keeping. The decoder, working from those few numbers alone, becomes a reconstruction prior — it fills in what it has learned tends to be there. The two halves trained jointly converge on a compact summary of the data's structure.

What the latent space encodes

The 2-D latent in the figure is doing a job a lot like PCA, but non-linear. Each preset lands at its own spot in the plane; points between presets decode to blends. Whatever varies most across the training set ends up as a direction in latent space. Walk along that direction and the reconstruction smoothly morphs — without you ever telling the network what "morph" means.

Why this is unsupervised

There are no labels. The target is the input — every example supervises its own reconstruction. That's the appeal: any pile of unlabelled data is enough to learn an embedding you can then plug into a classifier, a clustering algorithm, or an anomaly detector.

Denoising autoencoders

A small twist with outsized benefits: corrupt the input (add noise, mask pixels) and train the decoder to recover the clean original. The model can no longer memorise inputs because the inputs change every step — it has to learn what should be there. This is the direct ancestor of masked-token pretraining in language models.

Turning it generative: the VAE

A plain AE's latent space has holes — sample a random code and the decoder usually returns garbage, because nothing trained it on that point. The variational autoencoder fixes that: the encoder outputs a distribution over codes and every one is pulled toward a shared prior, so the whole space becomes samplable. That turns an autoencoder into a proper generative model — and it gets its own treatment (the ELBO, the reparameterisation trick, why it's a lower bound, sampling) on the Variational Autoencoders page. The VAE mode toggle in the figure above is a taste of it: the code becomes a little cloud rather than a point.

Reach for it when

  • Learning representations from a pile of unlabelled data
  • Non-linear dimensionality reduction (a deeper cousin of PCA)
  • Anomaly detection — large reconstruction error flags an outlier
  • Denoising or inpainting where the corruption pattern is known
  • Pretraining an embedding from unlabelled data to feed a classifier or clusterer

Skip it when

  • You have labels — supervised pretraining usually gives sharper embeddings
  • You want a model you can sample from — reach for the VAE, a GAN, or diffusion
  • Linear methods (PCA) are already good enough
  • Best embeddings for transfer are the goal — a contrastive method (SimCLR, CLIP) usually wins

import torch.nn as nn

class AE(nn.Module):
    def __init__(self, d_in=784, d_latent=32):
        super().__init__()
        self.enc = nn.Sequential(nn.Linear(d_in, 256), nn.ReLU(), nn.Linear(256, d_latent))
        self.dec = nn.Sequential(nn.Linear(d_latent, 256), nn.ReLU(), nn.Linear(256, d_in))
    def forward(self, x):
        z = self.enc(x)
        return self.dec(z), z

# Loss: just reconstruction error
loss = ((model(x)[0] - x) ** 2).mean()
Want the loss choices, regularised variants, and the PCA connection?

The reconstruction objective

$$ \min_{\theta,\phi} \;\; \mathbb{E}_{\mathbf{x}}\big\lVert \mathbf{x} - g_\theta\!\big(f_\phi(\mathbf{x})\big) \big\rVert^2, \qquad f_\phi:\mathbb{R}^d \to \mathbb{R}^k,\;\; k \ll d $$

  • fφencoder — maps the input down to a k-dimensional code
  • gθdecoder — maps the code back up to the input space
  • k ≪ dthe bottleneck: far fewer code dims than input dims
  • No labels — the target is the input (self-supervised)

$$ \text{minimise the average } \big\lVert\, \text{input} - \text{decode}(\text{encode}(\text{input}))\, \big\rVert^2 $$

In words. An autoencoder is trained on a single job: reproduce its own input after squeezing it through a narrow, k-dimensional bottleneck. There's no label and no other term — just the gap between the input and its reconstruction. Because the code has far fewer numbers than the input (k ≪ d), the encoder can't keep everything; it's forced to spend its few dimensions on whatever matters most for rebuilding the data, and the decoder learns to fill in the rest. Everything else on this page — denoising, sparsity, the variational twist — is this objective plus one extra term or corruption.

Reconstruction loss — pick by data type. Use MSE (squared error) for continuous outputs like image pixels in [0,1] or audio amplitudes; use binary cross-entropy per pixel when the decoder produces Bernoulli probabilities (the original VAE paper's choice for MNIST). For discrete data — text tokens, VQ codes — use ordinary cross-entropy. The loss implicitly encodes a noise model on the output: MSE assumes Gaussian noise around the reconstruction, BCE assumes Bernoulli. Mismatched losses produce blurry or weirdly sharp samples.

Undercomplete vs. overcomplete. Undercomplete means the latent is smaller than the input — the bottleneck does the regularising for you. Overcomplete (latent bigger than input) needs help to avoid copying the input straight through: sparsity penalties, denoising corruption, or contractive penalties on the encoder Jacobian. Without one of those, an overcomplete AE happily learns the identity and tells you nothing.

Tied weights. A classical trick: force the decoder weights to be the transpose of the encoder weights (W_dec = W_enc^T). Halves the parameter count and gives a clean analogy to PCA — for a linear AE with MSE loss and tied weights, the optimal solution literally spans the top principal components. Modern deep AEs usually don't bother, but it's a clean inductive bias for small models.

Regularised variants.

  • Sparse AE — add an L1 penalty (or a KL penalty on activation means) so most latent units stay near zero per example. Each input then activates a different small subset, encouraging interpretable units.
  • Contractive AE — penalise the Frobenius norm of the encoder Jacobian, ‖∂z/∂x‖². Makes the code locally insensitive to tiny input perturbations — it stays put on the data manifold.
  • Denoising AE — corrupt the input, reconstruct the clean version. Practically the most useful — no extra term in the loss, just data augmentation on the input side.

The variational autoencoder. Make the encoder output a mean and a variance instead of a point, pull that distribution toward a unit-Gaussian prior with a KL term, and the bottleneck becomes a samplable latent space — an autoencoder that's also a generative model. That variational objective (the ELBO), the reparameterisation trick, and the sampling story have their own dedicated page: Variational Autoencoders.

Reach for it when

  • Non-linear dimensionality reduction — a deeper cousin of PCA
  • Anomaly detection via reconstruction error
  • Representation learning where you can't get labels
  • Denoising or filling in corrupted inputs

Skip it when

  • You need to generate new samples — reach for the VAE, a GAN, or diffusion
  • You only need embeddings for transfer — contrastive methods (SimCLR, CLIP) are usually sharper
  • Linear structure is already enough — PCA is faster and closed-form
  • You need exact likelihoods — reach for a normalising flow

import torch, torch.nn as nn

class AE(nn.Module):
    def __init__(self, d_in=784, d_latent=32):
        super().__init__()
        self.enc = nn.Sequential(nn.Linear(d_in, 256), nn.ReLU(), nn.Linear(256, d_latent))
        self.dec = nn.Sequential(nn.Linear(d_latent, 256), nn.ReLU(), nn.Linear(256, d_in))
    def forward(self, x):
        return self.dec(self.enc(x))

# Denoising AE: corrupt the input, reconstruct the CLEAN target — no extra loss term.
def denoising_loss(model, x, noise=0.3):
    x_noisy = x + noise * torch.randn_like(x)
    return ((model(x_noisy) - x) ** 2).mean()

# Sparse AE: L1 penalty on the codes, so only a few latent units fire per example.
def sparse_loss(model, x, l1=1e-3):
    z = model.enc(x)
    return ((model.dec(z) - x) ** 2).mean() + l1 * z.abs().mean()
Want the PCA connection, VQ-VAE, and the modern role of autoencoders?

A linear autoencoder is PCA

$$ \min_{W_e,\,W_d} \; \mathbb{E}_{\mathbf{x}}\big\lVert \mathbf{x} - W_d W_e \mathbf{x} \big\rVert^2 \;\;\Longrightarrow\;\; \operatorname{span}(W_d) = \text{top-}k\text{ principal subspace} $$

  • Welinear encoder (k × d); Wdlinear decoder (d × k)
  • With MSE loss, the optimum spans exactly the top-k PCA directions
  • Non-linear layers make it strictly more expressive — "non-linear PCA"

$$ \text{best \emph{linear} autoencoder} \;=\; \text{project onto the top-}k\text{ principal components} $$

In words. Strip out the non-linearities and an autoencoder has nowhere clever to go: a linear encoder and decoder trained on squared error can do no better than projecting onto the top-k principal components — a linear AE literally is PCA (up to a rotation of the code). That's the anchor. Now add non-linear layers and the encoder can bend those axes to follow a curved data manifold — capturing structure a straight PCA plane can't. That's the whole reason to reach for an autoencoder over PCA: it's non-linear dimensionality reduction, with PCA as the degenerate linear case.

The variational family lives next door. The generative side of autoencoders — the ELBO, β-VAE and disentanglement, posterior collapse and its fixes (KL annealing, free bits) — is covered on the Variational Autoencoders page. What follows is the role plain and vector-quantised autoencoders play in modern systems.

VQ-VAE — discrete latents. Replace the Gaussian latent with the nearest entry in a learned codebook of vectors. The encoder outputs a continuous vector, you snap it to the closest codebook entry, and the decoder reconstructs from that. The discrete bottleneck eliminates posterior collapse and produces tokens — which is why VQ-VAE is the tokeniser behind DALL-E v1, Parti, and most multi-modal LLMs that handle images or audio.

Latent diffusion — the modern role. Rombach et al. (2022, "Stable Diffusion") split generation into two stages: first train a VAE to compress images to a small latent grid, then train a diffusion model in that space rather than pixel space. The VAE handles the boring perceptual compression; the diffusion model handles the interesting semantic generation. This is by far the biggest production use of autoencoders today — almost every modern image / video model uses a learned VAE encoder under the hood.

Where this leaves plain AEs. For high-fidelity generation, diffusion has clearly won. But autoencoders are alive and well: as the compression frontends for latent diffusion, as tokenisers for multi-modal models (VQ-VAE), for anomaly detection in industrial monitoring, for non-linear dimensionality reduction that beats PCA on complex manifolds, and as a clean teaching example of representation learning.

Compared to self-supervised pretraining. Modern self-supervised methods (SimCLR, DINO, MAE, CLIP) often beat plain AEs at producing useful embeddings for downstream tasks — they optimise for invariance and separability rather than pixel-level reconstruction. The Masked Autoencoder (He et al., 2022) is a notable bridge: it's structurally an AE with heavy masking, but trained at ViT scale it produces excellent transfer features. The line between "autoencoder" and "self-supervised method" is blurry once you start masking.

Reach for it when

  • Latent diffusion — almost every modern image / video generator needs a VAE frontend
  • VQ-VAE — tokeniser for multi-modal models, discrete codes for downstream LMs
  • Non-linear dimensionality reduction — a curved-manifold upgrade on PCA
  • Anomaly detection at scale — reconstruction error is cheap and unsupervised
  • Masked AE pretraining — strong transfer features when you have a ViT and a lot of data

Skip it when

  • Sample sharpness is the only thing you care about — use diffusion directly in pixel space
  • You just need good embeddings — contrastive (CLIP, SimCLR, DINO) usually wins
  • You need exact likelihoods — use a normalising flow instead
  • Your data is small and tabular — PCA or a denoising AE tiny model is plenty

import torch

# VQ-VAE quantiser: snap each encoder output to its nearest codebook vector,
# turning a continuous bottleneck into discrete tokens.
def quantise(z_e, codebook):                 # z_e: (B, D), codebook: (K, D)
    d = (z_e.pow(2).sum(1, keepdim=True)
         - 2 * z_e @ codebook.t()
         + codebook.pow(2).sum(1))           # squared distance to every code
    idx = d.argmin(1)                        # index of the nearest codebook entry
    z_q = codebook[idx]                      # the quantised code
    # straight-through estimator: forward uses z_q, but gradients flow to z_e
    return z_e + (z_q - z_e).detach(), idx
Want the picture instead?