Mode

Key idea

Stack simple operations until something complex emerges. Like neurons in the brain, each layer does something simple — a weighted sum of its inputs, then a nonlinear bend. Stack enough of these and you can approximate just about any function: image classification, language translation, game playing.

Task Predict add or skip for 4 songs from tempo and mood (centred around 0). Ŷ is the prediction, target is the truth, ✓ marks a match. The targets are XOR-like — toggle activation to none and watch every prediction collapse.
A small 2-layer network — drag any weight, click a column to peek inside a neuron
Activation:
Hidden neurons: 3

Why we need activation functions

Try toggling activation to none in the figure. Every prediction collapses — and that's the point. Without a nonlinearity between layers, two matrix multiplications compose into a single matrix multiplication: W₂ · (W₁ · X) = (W₂ · W₁) · X. Stack 100 layers of pure matmuls and you still have exactly the expressive power of one. The activation (ReLU, sigmoid, GELU, …) breaks that collapse — each layer can bend the representation in a new way, and only then does deep mean anything.

How to choose an architecture

Two main knobs: width (neurons per layer) and depth (how many layers).

  • Wider layers carry more parallel features at once — useful when many independent things matter.
  • Deeper stacks build features compositionally — later layers reuse what earlier ones discovered.

For tabular problems, a few hidden layers of 64–256 units is usually plenty. For images, sequences, or graphs, the architecture itself encodes the data's structure — reach for a CNN, RNN, transformer, or GNN. In practice the hardest knobs are rarely depth or width — it's getting the input scale, regularisation, and learning rate right.

Why going deep works

Each layer can learn an abstraction built from the previous layer's features. In an image classifier: layer 1 picks up edges, layer 2 corners, layer 3 textures, layer 4 object parts, layer 5 whole objects. In language: characters → words → phrases → meaning. The universal approximation theorem (switch to the In-depth tier for the formal version) says one sufficiently wide hidden layer can in principle approximate any continuous function — but in practice deep narrow networks generalise far better than shallow wide ones with the same parameter count, because hierarchical composition lets the network reuse intermediate features instead of memorising every input combination separately.

Training: how the weights got there

The weights in the figure didn't appear by magic — they were learned. You define a loss (how wrong the predictions are), compute the gradient of the loss with respect to every weight using the chain rule (this procedure is called backpropagation), and nudge each weight a tiny step in the direction that lowers the loss. Repeat for millions of steps. The figure freezes one moment in that process; training is what carved the colours you see.

→ Full mechanics in Gradient Descent and Optimisation Algorithms.

Reach for it when

  • Tabular data and you want more flexibility than a linear model
  • You have plenty of data and compute
  • You'll use this as a building block in a bigger model
  • You want a starting point before reaching for CNNs / transformers

Skip it when

  • Small tabular data — gradient boosting usually wins
  • Data has obvious structure (images, sequences, graphs) — use a specialized architecture
  • You need interpretability of individual predictions
  • You're short on training data

import torch.nn as nn

model = nn.Sequential(
    nn.Linear(n_features, 128), nn.ReLU(),
    nn.Linear(128, 64),          nn.ReLU(),
    nn.Linear(64, n_classes),
)
Want the forward / backward math?

Forward pass

$$ \mathbf{h}^{(\ell)} \;=\; \sigma\!\left(W^{(\ell)} \mathbf{h}^{(\ell-1)} + \mathbf{b}^{(\ell)}\right) $$

  • h(ℓ)activations (outputs) of layer , with h(0) = x
  • W(ℓ), b(ℓ)weight matrix and bias vector of layer
  • σelementwise nonlinearity — ReLU, GELU, tanh, …
  • Apply this rule from ℓ = 1 up to the final layer to get the network's output

$$ \text{layer output} \;=\; \sigma\!\left(\text{weights} \times \text{previous layer output} + \text{bias}\right) $$

In words. Each layer does three things in order: multiply the previous layer's output (a list of numbers) by a matrix of weights, add a bias vector to shift the result, then run each number through a nonlinearity σ (sigma — usually ReLU, which is just max(0, x)). You repeat this for every layer, plugging each layer's output into the next. The superscript (ℓ) in the math version is just a layer label — "weights for layer 1", "weights for layer 2", and so on; they're different matrices, learned independently.

Training. Define a loss ℒ(θ), take its gradient with respect to all parameters via backpropagation (chain rule applied layer-by-layer), and step in the negative-gradient direction. SGD, Adam, AdamW are the optimizers people actually use.

Activations. ReLU is the default — cheap and avoids the saturating-gradient problem of tanh / sigmoid. GELU is smoother and dominates in transformers. Sigmoid / tanh appear only inside gates (LSTM, attention) where bounded outputs matter.

Initialization. Random Gaussian weights need to be scaled carefully — too big and activations blow up, too small and they vanish. Xavier/Glorot for tanh, He/Kaiming for ReLU. Most frameworks default to a sensible scheme.

Regularization. A network with enough parameters can memorize the training set perfectly — and then fail on new data. Regularization is anything that pushes the model away from rote memorization and toward solutions that generalize. The standard toolkit:

  • Dropout — randomly zero a fraction of activations during each training step so no single neuron becomes indispensable; equivalent to averaging an ensemble of thinned subnetworks.
  • Weight decay (L2) — add a penalty proportional to ‖W‖² to the loss; shrinks weights toward zero, which limits how sharply the function can bend.
  • Early stopping — hold out a validation set and stop training the moment validation loss turns back up.
  • Data augmentation — perturb each input (flip, crop, noise, paraphrase) so the model has to learn invariant features rather than memorize examples.

Modern nets often need less explicit regularization than older ones — the implicit regularization of SGD itself (its preference for flat, broad minima) does a surprising amount of the work.

Reach for it when

  • The data has no obvious structural prior — let a deep MLP discover features
  • You can apply the modern training recipe — ReLU/GELU, He init, Adam, dropout, weight decay
  • You're stacking many layers — residual connections + LayerNorm let you go deep without vanishing gradients
  • You'll embed it as a head or block inside a larger architecture (CNN, transformer, …)

Skip it when

  • The data has structure you can exploit — use it (convolutions for images, attention for sequences)
  • You need calibrated probabilities — neural nets are over-confident without post-hoc calibration
  • Embedded / microcontroller-class hardware with no SIMD or GPU and tight latency
  • You need step-by-step decision traces — neural nets are opaque inside

import torch, torch.nn as nn
from torch.optim import AdamW

class MLP(nn.Module):
    def __init__(self, d_in, d_out, hidden=(256, 128), dropout=0.1):
        super().__init__()
        layers, prev = [], d_in
        for h in hidden:
            layers += [nn.Linear(prev, h), nn.GELU(), nn.Dropout(dropout)]
            prev = h
        layers.append(nn.Linear(prev, d_out))
        self.net = nn.Sequential(*layers)
    def forward(self, x): return self.net(x)

model = MLP(d_in=20, d_out=10)
opt   = AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
loss_fn = nn.CrossEntropyLoss()

for epoch in range(50):
    for xb, yb in loader:
        loss = loss_fn(model(xb), yb)
        opt.zero_grad(); loss.backward(); opt.step()
Want depth-vs-width, universal approximation, and the failure modes?

Universal approximation

$$ \forall \,\varepsilon{>}0,\; \exists\, n,\, \{w_i, b_i, c_i\}: \quad \sup_{\mathbf{x}\in K} \left|\, f(\mathbf{x}) - \sum_{i=1}^{n} c_i\, \sigma(\mathbf{w}_i^\top \mathbf{x} + b_i)\,\right| < \varepsilon $$

  • ∀ ε > 0"for any error tolerance you pick, however tiny"
  • ∃ n, {wi, bi, ci}"there exists some number of neurons and a set of weights/biases/output-weights"
  • supx ∈ Kworst-case error over a bounded input region K
  • Σsummation — one term per hidden neuron
  • A single hidden layer with enough units can approximate any continuous f on a compact set. Depth makes "enough" tractable.

$$ \text{for any error tolerance } \varepsilon \;\Rightarrow\; \text{a 1-hidden-layer network with enough neurons gets within } \varepsilon \text{ of } f \text{ everywhere} $$

In words. Pick any continuous function f you want to approximate, and any error budget ε (epsilon — Greek letter for a small positive number). The theorem guarantees you can find a single-hidden-layer network — with enough neurons — whose output never differs from f by more than ε, anywhere in a bounded input region. The catch: "enough neurons" can mean an astronomical number. In practice, going deep (many layers, fewer neurons each) is far more efficient than going wide (one huge layer) — that's why we stack.

Depth vs. width. The universal approximation theorem says one hidden layer suffices. In practice, deep narrow networks generalize much better than shallow wide ones for the same parameter count — depth lets the model compose features hierarchically. Modern intuition: width gives capacity, depth gives composition.

Vanishing / exploding gradients. In deep networks, repeated multiplication of small or large Jacobian factors makes gradients shrink or explode through the chain rule. Modern fixes: ReLU (constant gradient on positive inputs), batch / layer normalization (keep activations on a sensible scale), and residual connections — additive "gradient highways" that let the loss reach early layers undamaged.

The optimization landscape is non-convex with many local minima and saddle points. SGD finds flat minima that generalize well; sharp minima generalize poorly. This is part of why batch size and learning rate matter for generalization, not just for training speed.

Dead ReLUs. If a unit's pre-activation goes strongly negative and stays there, its gradient is zero and it never updates again. Mitigations: smaller learning rate, Leaky ReLU / GELU, careful initialization, batch norm.

Double descent. Past the interpolation threshold (where the network can perfectly memorize the training set), test error often decreases as you grow the model further. Connect to bias-variance — classical theory predicts the opposite. Amazon's MLU-Explain — Double Descent has a beautiful interactive walkthrough showing the curve form as model size grows.

Reach for it when

  • You can afford to scale (more data, more parameters)
  • End-to-end differentiability matters — embed in any pipeline
  • You have a budget for hyperparameter search (LR, depth, width, regularization)
  • Pre-trained embeddings exist for your domain

Skip it when

  • Very small data — gradient boosting and probabilistic models win
  • You can't engineer reasonable hyperparameters and don't want to AutoML them
  • Adversarial robustness is a hard requirement
  • Causal inference / counterfactual reasoning is the goal

import torch, torch.nn as nn

# Residual MLP block — mitigates vanishing gradients, enables deeper models
class ResBlock(nn.Module):
    def __init__(self, d, dropout=0.1):
        super().__init__()
        self.norm = nn.LayerNorm(d)
        self.fc   = nn.Sequential(
            nn.Linear(d, 4 * d), nn.GELU(), nn.Dropout(dropout),
            nn.Linear(4 * d, d),
        )
    def forward(self, x):
        return x + self.fc(self.norm(x))

# A "modern" deep MLP: pre-norm + residual + GELU expansion (the MLP-Mixer pattern)
class DeepMLP(nn.Module):
    def __init__(self, d_in, d_out, hidden=384, n_blocks=6):
        super().__init__()
        self.embed  = nn.Linear(d_in, hidden)
        self.blocks = nn.ModuleList([ResBlock(hidden) for _ in range(n_blocks)])
        self.head   = nn.Linear(hidden, d_out)
    def forward(self, x):
        x = self.embed(x)
        for blk in self.blocks: x = blk(x)
        return self.head(x)
Want the picture instead?