Autoencoders — Layerwise ML

Mode

Key idea

Squeeze data through a narrow channel, then rebuild it. An encoder compresses the input to a small code; a decoder tries to reconstruct the original from that code alone. The bottleneck has nowhere to hide redundancy — only the features that actually help reconstruction survive.

Encoder · 2-D latent · decoder — drag the dot to decode any code

Input:

Why the bottleneck matters

If the code were as wide as the input, the network could just copy values through and learn nothing — the trivial identity function. The bottleneck makes that impossible: only a fixed number of dimensions get to pass, so the encoder has to choose what's worth keeping. The decoder, working from those few numbers alone, becomes a reconstruction prior — it fills in what it has learned tends to be there. The two halves trained jointly converge on a compact summary of the data's structure.

What the latent space encodes

The 2-D latent in the figure is doing a job a lot like PCA, but non-linear. Each preset lands at its own spot in the plane; points between presets decode to blends. Whatever varies most across the training set ends up as a direction in latent space. Walk along that direction and the reconstruction smoothly morphs — without you ever telling the network what "morph" means.

Why this is unsupervised

There are no labels. The target is the input — every example supervises its own reconstruction. That's the appeal: any pile of unlabelled data is enough to learn an embedding you can then plug into a classifier, a clustering algorithm, or an anomaly detector.

Denoising autoencoders

A small twist with outsized benefits: corrupt the input (add noise, mask pixels) and train the decoder to recover the clean original. The model can no longer memorise inputs because the inputs change every step — it has to learn what should be there. This is the direct ancestor of masked-token pretraining in language models.

Turning it generative: the VAE

A plain AE's latent space has holes — sample a random code and the decoder usually returns garbage, because nothing trained it on that point. The variational autoencoder fixes that: the encoder outputs a distribution over codes and every one is pulled toward a shared prior, so the whole space becomes samplable. That turns an autoencoder into a proper generative model — and it gets its own treatment (the ELBO, the reparameterisation trick, why it's a lower bound, sampling) on the Variational Autoencoders page. The VAE mode toggle in the figure above is a taste of it: the code becomes a little cloud rather than a point.

Reach for it when

Learning representations from a pile of unlabelled data
Non-linear dimensionality reduction (a deeper cousin of PCA)
Anomaly detection — large reconstruction error flags an outlier
Denoising or inpainting where the corruption pattern is known
Pretraining an embedding from unlabelled data to feed a classifier or clusterer

Skip it when

You have labels — supervised pretraining usually gives sharper embeddings
You want a model you can sample from — reach for the VAE, a GAN, or diffusion
Linear methods (PCA) are already good enough
Best embeddings for transfer are the goal — a contrastive method (SimCLR, CLIP) usually wins

import torch.nn as nn

class AE(nn.Module):
    def __init__(self, d_in=784, d_latent=32):
        super().__init__()
        self.enc = nn.Sequential(nn.Linear(d_in, 256), nn.ReLU(), nn.Linear(256, d_latent))
        self.dec = nn.Sequential(nn.Linear(d_latent, 256), nn.ReLU(), nn.Linear(256, d_in))
    def forward(self, x):
        z = self.enc(x)
        return self.dec(z), z

# Loss: just reconstruction error
loss = ((model(x)[0] - x) ** 2).mean()

Want the loss choices, regularised variants, and the PCA connection?

The reconstruction objective

$$ \min_{\theta,\phi} \;\; \mathbb{E}_{\mathbf{x}}\big\lVert \mathbf{x} - g_\theta\!\big(f_\phi(\mathbf{x})\big) \big\rVert^2, \qquad f_\phi:\mathbb{R}^d \to \mathbb{R}^k,\;\; k \ll d $$

f_φencoder — maps the input down to a k-dimensional code
g_θdecoder — maps the code back up to the input space
k ≪ dthe bottleneck: far fewer code dims than input dims
No labels — the target is the input (self-supervised)

$$ \text{minimise the average } \big\lVert\, \text{input} - \text{decode}(\text{encode}(\text{input}))\, \big\rVert^2 $$

In words. An autoencoder is trained on a single job: reproduce its own input after squeezing it through a narrow, k-dimensional bottleneck. There's no label and no other term — just the gap between the input and its reconstruction. Because the code has far fewer numbers than the input (k ≪ d), the encoder can't keep everything; it's forced to spend its few dimensions on whatever matters most for rebuilding the data, and the decoder learns to fill in the rest. Everything else on this page — denoising, sparsity, the variational twist — is this objective plus one extra term or corruption.

Reconstruction loss — pick by data type. Use MSE (squared error) for continuous outputs like image pixels in [0,1] or audio amplitudes; use binary cross-entropy per pixel when the decoder produces Bernoulli probabilities (the original VAE paper's choice for MNIST). For discrete data — text tokens, VQ codes — use ordinary cross-entropy. The loss implicitly encodes a noise model on the output: MSE assumes Gaussian noise around the reconstruction, BCE assumes Bernoulli. Mismatched losses produce blurry or weirdly sharp samples.

Undercomplete vs. overcomplete. Undercomplete means the latent is smaller than the input — the bottleneck does the regularising for you. Overcomplete (latent bigger than input) needs help to avoid copying the input straight through: sparsity penalties, denoising corruption, or contractive penalties on the encoder Jacobian. Without one of those, an overcomplete AE happily learns the identity and tells you nothing.

Tied weights. A classical trick: force the decoder weights to be the transpose of the encoder weights (W_dec = W_enc^T). Halves the parameter count and gives a clean analogy to PCA — for a linear AE with MSE loss and tied weights, the optimal solution literally spans the top principal components. Modern deep AEs usually don't bother, but it's a clean inductive bias for small models.

Regularised variants.

Sparse AE — add an L1 penalty (or a KL penalty on activation means) so most latent units stay near zero per example. Each input then activates a different small subset, encouraging interpretable units.
Contractive AE — penalise the Frobenius norm of the encoder Jacobian, ‖∂z/∂x‖². Makes the code locally insensitive to tiny input perturbations — it stays put on the data manifold.
Denoising AE — corrupt the input, reconstruct the clean version. Practically the most useful — no extra term in the loss, just data augmentation on the input side.

The variational autoencoder. Make the encoder output a mean and a variance instead of a point, pull that distribution toward a unit-Gaussian prior with a KL term, and the bottleneck becomes a samplable latent space — an autoencoder that's also a generative model. That variational objective (the ELBO), the reparameterisation trick, and the sampling story have their own dedicated page: Variational Autoencoders.

Reach for it when

Non-linear dimensionality reduction — a deeper cousin of PCA
Anomaly detection via reconstruction error
Representation learning where you can't get labels
Denoising or filling in corrupted inputs

Skip it when

You need to generate new samples — reach for the VAE, a GAN, or diffusion
You only need embeddings for transfer — contrastive methods (SimCLR, CLIP) are usually sharper
Linear structure is already enough — PCA is faster and closed-form
You need exact likelihoods — reach for a normalising flow

import torch, torch.nn as nn

class AE(nn.Module):
    def __init__(self, d_in=784, d_latent=32):
        super().__init__()
        self.enc = nn.Sequential(nn.Linear(d_in, 256), nn.ReLU(), nn.Linear(256, d_latent))
        self.dec = nn.Sequential(nn.Linear(d_latent, 256), nn.ReLU(), nn.Linear(256, d_in))
    def forward(self, x):
        return self.dec(self.enc(x))

# Denoising AE: corrupt the input, reconstruct the CLEAN target — no extra loss term.
def denoising_loss(model, x, noise=0.3):
    x_noisy = x + noise * torch.randn_like(x)
    return ((model(x_noisy) - x) ** 2).mean()

# Sparse AE: L1 penalty on the codes, so only a few latent units fire per example.
def sparse_loss(model, x, l1=1e-3):
    z = model.enc(x)
    return ((model.dec(z) - x) ** 2).mean() + l1 * z.abs().mean()

Want the PCA connection, VQ-VAE, and the modern role of autoencoders?

A linear autoencoder is PCA

$$ \min_{W_e,\,W_d} \; \mathbb{E}_{\mathbf{x}}\big\lVert \mathbf{x} - W_d W_e \mathbf{x} \big\rVert^2 \;\;\Longrightarrow\;\; \operatorname{span}(W_d) = \text{top-}k\text{ principal subspace} $$

W_elinear encoder (k × d); W_dlinear decoder (d × k)
With MSE loss, the optimum spans exactly the top-k PCA directions
Non-linear layers make it strictly more expressive — "non-linear PCA"

$$ \text{best \emph{linear} autoencoder} \;=\; \text{project onto the top-}k\text{ principal components} $$

In words. Strip out the non-linearities and an autoencoder has nowhere clever to go: a linear encoder and decoder trained on squared error can do no better than projecting onto the top-k principal components — a linear AE literally is PCA (up to a rotation of the code). That's the anchor. Now add non-linear layers and the encoder can bend those axes to follow a curved data manifold — capturing structure a straight PCA plane can't. That's the whole reason to reach for an autoencoder over PCA: it's non-linear dimensionality reduction, with PCA as the degenerate linear case.

The variational family lives next door. The generative side of autoencoders — the ELBO, β-VAE and disentanglement, posterior collapse and its fixes (KL annealing, free bits) — is covered on the Variational Autoencoders page. What follows is the role plain and vector-quantised autoencoders play in modern systems.

VQ-VAE — discrete latents. Replace the Gaussian latent with the nearest entry in a learned codebook of vectors. The encoder outputs a continuous vector, you snap it to the closest codebook entry, and the decoder reconstructs from that. The discrete bottleneck eliminates posterior collapse and produces tokens — which is why VQ-VAE is the tokeniser behind DALL-E v1, Parti, and most multi-modal LLMs that handle images or audio.

Latent diffusion — the modern role. Rombach et al. (2022, "Stable Diffusion") split generation into two stages: first train a VAE to compress images to a small latent grid, then train a diffusion model in that space rather than pixel space. The VAE handles the boring perceptual compression; the diffusion model handles the interesting semantic generation. This is by far the biggest production use of autoencoders today — almost every modern image / video model uses a learned VAE encoder under the hood.

Where this leaves plain AEs. For high-fidelity generation, diffusion has clearly won. But autoencoders are alive and well: as the compression frontends for latent diffusion, as tokenisers for multi-modal models (VQ-VAE), for anomaly detection in industrial monitoring, for non-linear dimensionality reduction that beats PCA on complex manifolds, and as a clean teaching example of representation learning.

Compared to self-supervised pretraining. Modern self-supervised methods (SimCLR, DINO, MAE, CLIP) often beat plain AEs at producing useful embeddings for downstream tasks — they optimise for invariance and separability rather than pixel-level reconstruction. The Masked Autoencoder (He et al., 2022) is a notable bridge: it's structurally an AE with heavy masking, but trained at ViT scale it produces excellent transfer features. The line between "autoencoder" and "self-supervised method" is blurry once you start masking.

Reach for it when

Latent diffusion — almost every modern image / video generator needs a VAE frontend
VQ-VAE — tokeniser for multi-modal models, discrete codes for downstream LMs
Non-linear dimensionality reduction — a curved-manifold upgrade on PCA
Anomaly detection at scale — reconstruction error is cheap and unsupervised
Masked AE pretraining — strong transfer features when you have a ViT and a lot of data

Skip it when

Sample sharpness is the only thing you care about — use diffusion directly in pixel space
You just need good embeddings — contrastive (CLIP, SimCLR, DINO) usually wins
You need exact likelihoods — use a normalising flow instead
Your data is small and tabular — PCA or a denoising AE tiny model is plenty

import torch

# VQ-VAE quantiser: snap each encoder output to its nearest codebook vector,
# turning a continuous bottleneck into discrete tokens.
def quantise(z_e, codebook):                 # z_e: (B, D), codebook: (K, D)
    d = (z_e.pow(2).sum(1, keepdim=True)
         - 2 * z_e @ codebook.t()
         + codebook.pow(2).sum(1))           # squared distance to every code
    idx = d.argmin(1)                        # index of the nearest codebook entry
    z_q = codebook[idx]                      # the quantised code
    # straight-through estimator: forward uses z_q, but gradients flow to z_e
    return z_e + (z_q - z_e).detach(), idx

Want the picture instead?

Inside the figure

The left grid is the 12×12 input. The orange dot in the centre panel is the latent code — two numbers that together summarise the entire image. The right grid is what the decoder reconstructs from those two numbers alone.

Click a preset and the encoder lands the dot at a learned position. Drag the dot anywhere and the decoder will try to reconstruct something — points near a preset look like that preset, points between presets give blends, and points far from any preset decode to garbage.

Toggle VAE mode and the encoder starts outputting a small cloud instead of a single point: each forward pass samples a slightly different code. That stochasticity is exactly what lets the trained VAE generate plausible new samples — the decoder has learned to handle a whole neighbourhood, not just one point.

Where to learn more

Zhang et al. — Dive into Deep Learning (autoencoders) Interactive, code-first textbook. The generative-models chapters walk through plain AEs, denoising AEs, and the family with runnable PyTorch.
Lilian Weng — From Autoencoder to Beta-VAE A single tour through AE, denoising AE, sparse AE, contractive AE, VAE, and β-VAE. The clearest single source on the whole family.
Kingma & Welling — An Introduction to Variational Autoencoders The original VAE authors' tutorial — the deep external reference for the variational cousin (ELBO, reparameterisation, amortised inference). Start with our Variational Autoencoders page for the intuition.
van den Oord et al. (2017) — VQ-VAE Discrete-codebook latents — the backbone of modern multi-modal tokenisers (DALL-E v1, Parti, audio LMs).
Rombach et al. (2022) — High-Resolution Image Synthesis with Latent Diffusion Models Stable Diffusion's paper. Shows how a VAE frontend lets diffusion scale to high-resolution generation cheaply — the dominant production use of autoencoders today.