Convolutional Neural Networks (CNN)

Mode

Key idea

Share one small filter across the whole image — and learn what it should detect. A CNN scans a tiny window (the kernel) over the input, looking for a specific pattern: an edge, a texture, a colour blob. The same handful of weights is reused at every position, so the network has dramatically fewer parameters than a dense net and already "knows" that a pattern in the top-left is the same kind of thing in the bottom-right. The kernel weights aren't designed by hand — they're learned by gradient descent from a loss signal, the same way every other neural-net weight is. Stack many such layers and the network grows from edges, to textures, to parts, to whole objects.

Image

Filter

Input image · 24×24

the same filter slides over every position

Filter · 3×3 = 9 weights

applied at every position

Feature map · 22×22

indigo = positive · orange = negative

Pooling. Downsample the feature map by taking the max (or mean) of every 2×2 block — fewer numbers, larger receptive field. Hover either side to see the block-to-cell correspondence.

Feature map · 22×22

each 2×2 block becomes one output cell

Pooled · 11×11

half the spatial size, same channels

Why share weights across the image

A dense network treats every pixel as an independent feature: a 224×224 RGB image becomes a 150,528-dim vector, and a single hidden unit with 1,024 neurons already burns 150 million parameters before you've learned anything. Worse, the network has to relearn "an edge looks like an edge" separately at every position — it has no built-in notion that the top-left and bottom-right of an image play by the same rules.

A CNN replaces that with a tiny 3×3 (or 5×5) kernel that is reused at every spatial location. Two consequences fall out for free:

Parameter count collapses. A 3×3 conv with 64 output channels has 9·C_in·64 weights regardless of image size. You can now train a deep network on modest data.
Translation equivariance. Shift the input by one pixel and the feature map shifts by one pixel. The model doesn't need separate evidence for the same pattern at different positions.

Why pool / downsample

A single conv layer only sees a 3×3 patch. To recognise a face you need to integrate information from hundreds of pixels. The classic trick is pooling — replace each 2×2 block of the feature map with its max (or mean), halving the spatial dimensions. After a few rounds, what was a 224×224 image is a 7×7 grid of high-level features, each summarising a chunk of the original. Modern designs often skip pooling and use strided convolutions (jump 2 pixels at a time) for the same effect, with the benefit that the downsampling step is itself learned.

Why stack convolutions

A single 3×3 kernel can only see three pixels in any direction. But two stacked 3×3 convs see a 5×5 patch, three see a 7×7, and so on — the receptive field grows with depth. Combine that with downsampling and a deep CNN's top layers see the entire image, but built compositionally: edges from layer 1 → corners and curves from layer 2 → textures and parts from layer 3 → object-shaped responses near the top. This hierarchy is the reason CNNs work, and the reason their feature maps are interpretable in a way few other architectures are.

→ For the receptive-field formula and dilated convs, switch to the In-depth tier.

Reach for it when

Image classification, detection, segmentation — pretrained ResNet / EfficientNet / ConvNeXt backbones are a phone call away
Audio spectrograms, medical scans, satellite imagery — anything on a 2D grid with local structure
You have limited data and need the locality / translation-equivariance prior to do real work
On-device or low-latency inference — small CNNs are still hard to beat for FLOPs-per-accuracy

Skip it when

Text or sequences — use a transformer or RNN
Graph-structured data — use a GNN
The important relationships are long-range and don't compose locally
You have tens of millions of labelled images and the compute for a ViT — it will likely edge ahead

import torch.nn as nn

model = nn.Sequential(
    nn.Conv2d(3, 32, kernel_size=3, padding=1), nn.ReLU(), nn.MaxPool2d(2),
    nn.Conv2d(32, 64, kernel_size=3, padding=1), nn.ReLU(), nn.MaxPool2d(2),
    nn.Flatten(),
    nn.Linear(64 * 8 * 8, n_classes),
)

Want the convolution math, padding/stride, and ResNets?

2D convolution

$$ (\mathbf{x} \ast \mathbf{k})[i, j] \;=\; \sum_{m, n} \mathbf{x}[i+m,\, j+n] \cdot \mathbf{k}[m, n] $$

xinput feature map (or image channel)
kkernel — small (e.g. 3×3 or 5×5), learned
(i, j)output spatial position; the kernel slides over every such position
Output is "how strongly the filter's pattern appears at each position"

$$ \text{output}[i, j] \;=\; \text{sum over the kernel window of (image pixel} \times \text{kernel weight)} $$

In words. At every output position (i, j), line up the small kernel (a 3×3 or 5×5 grid of learned numbers) over a matching patch of the input. Multiply each kernel weight by the pixel it sits on, add all those products together — that's one output number. Slide the kernel one pixel over and repeat. The ∗ symbol means "convolution"; the Σ just means "add up everything inside the kernel window". The same kernel is reused at every position — that's why CNNs have so few parameters.

Anatomy of a conv layer. Each layer learns F filters, each of shape (C_in, k, k), applied at every spatial position. Input shape (B, C_in, H, W) becomes (B, F, H', W'). Three knobs determine H' and W':

Padding — pixels of zeros added around the input border. With padding=1 and a 3×3 kernel, the spatial dimensions are preserved; without padding, every conv shrinks the map by 2.
Stride — how many input pixels the kernel jumps between output positions. Stride 2 halves the spatial dimensions and is the modern replacement for pooling.
Dilation — inserts gaps between kernel taps, so a 3×3 dilated-by-2 kernel covers a 5×5 patch with the same 9 weights. Used in segmentation networks to grow the receptive field without losing resolution.

Pooling vs strided convolution. Max-pooling (or average-pooling) is parameter-free and provides a hard form of translation invariance over its window. Strided convolutions are learned downsampling and tend to preserve more information; most modern architectures (ResNet, ConvNeXt, EfficientNet) use them in preference to pooling, keeping a single global average pool at the very end before the classifier head.

ResNet (He et al., 2015). Naively stacking 50+ conv layers used to hurt — training diverged, gradients vanished. ResNet added a skip connection around each block: y = x + F(x). Now the layer only has to learn the residual on top of the identity, and gradients have a direct highway back to early layers. This single change unlocked networks 10× deeper than what had been possible and is now standard in essentially every modern architecture, vision or otherwise.

Normalization. Batch normalization is the classical default — normalize activations across the batch, learn a per-channel scale and shift. It speeds up training and acts as mild regularization, but it breaks with small batches (detection, segmentation, video). Group norm and layer norm are batch-independent and have largely taken over in modern recipes. The norm-vs-activation order (pre-norm vs post-norm) matters for training stability — pre-norm is the safer default.

Reach for it when

Image classification — start from a pretrained ResNet / EfficientNet / ConvNeXt and fine-tune the head
Object detection — Faster R-CNN, YOLO, RetinaNet all use CNN backbones
Semantic / instance segmentation — U-Net for medical, Mask R-CNN for natural images
Audio, video, medical imaging where labelled data is scarce but a pretrained backbone exists

Skip it when

ImageNet-scale datasets and you have the compute budget — Vision Transformers may pull ahead
The task needs reasoning across the whole image at every layer (use attention)
Inputs aren't really grid-structured (sets, graphs, irregularly-sampled point clouds)
You need symbolic / discrete-token reasoning

import torch.nn as nn
import torchvision.models as tvm

# Transfer learning: load a pretrained ResNet, replace the final layer
model = tvm.resnet50(weights=tvm.ResNet50_Weights.DEFAULT)
model.fc = nn.Linear(model.fc.in_features, n_classes)

# Freeze backbone for the first few epochs, then unfreeze and fine-tune
for p in model.parameters():
    p.requires_grad = False
for p in model.fc.parameters():
    p.requires_grad = True

Want receptive fields, depthwise-separable, and ViT vs CNN?

Effective receptive field

$$ \text{RF}^{(\ell)} \;=\; \text{RF}^{(\ell-1)} \;+\; \big(k^{(\ell)} - 1\big)\,\prod_{j<\ell} s^{(j)} $$

k^(ℓ)kernel size at layer ℓ
s^(j)stride at layer j
∏product over all earlier layers' strides
The patch of input that influences one output activation grows with depth

$$ \text{receptive field at layer } \ell \;=\; \text{receptive field at } \ell{-}1 \;+\; (\text{kernel size} - 1) \times \text{product of all earlier strides} $$

In words. The receptive field is the size of the input patch that influences one pixel of a deeper feature map. Each new layer's kernel sees a window of its input, but the input has already been downsampled by previous strides — so its window covers a larger region of the original image. The ∏ (capital pi) means "multiply together" — here, multiply all the strides from earlier layers. A stride-2 layer means each subsequent pixel covers a 2× wider piece of the original; a 3×3 kernel after that sees a 6-pixel-wide patch of the input.

Effective receptive field. The theoretical formula above is an upper bound. In practice the effective receptive field is roughly Gaussian-shaped and much smaller — most of the gradient mass concentrates near the centre. Dilated (atrous) convolutions enlarge the receptive field without growing parameters or losing resolution, which is why they show up in semantic segmentation (DeepLab) and dense prediction generally.

Depthwise-separable convolutions. A standard 3×3 conv with C_in → C_out channels costs 9 · C_in · C_out parameters. Decompose it into a 3×3 depthwise conv (one filter per input channel, no channel mixing) followed by a 1×1 pointwise conv (mixes channels at every spatial position), and the cost drops to 9 · C_in + C_in · C_out — often a 5-10× reduction. This is the trick behind MobileNet, EfficientNet, and ConvNeXt; the small accuracy loss is more than paid back by the FLOPs saved.

Vision Transformers (ViT) vs CNNs. A ViT slices the image into 16×16 patches, embeds each patch, and runs them through a stack of self-attention layers. It throws away the CNN's locality and translation-equivariance priors — and so on small data it underperforms CNNs badly. On hundreds of millions of labelled images (JFT-300M, LAION) it overtakes them: with enough data, learning the priors is better than hard-coding them. The modern picture is more nuanced — ConvNeXt (Liu et al., 2022) showed that a CNN modernised with depthwise convs, LayerNorm, GELU, and a ViT-style training recipe matches ViT performance on ImageNet, suggesting the gap was mostly about training tricks. Hybrids like CoAtNet and MaxViT combine convs in early stages with attention in later stages and frequently top the leaderboards.

Attention-augmented convolutions. Even before full ViTs, people grafted self-attention onto CNN backbones to inject global context — squeeze-and-excitation (channel-wise gating), non-local blocks, axial attention. They're still useful when you want CNN efficiency with the ability to do some long-range reasoning.

Inductive bias as a feature. CNNs hard-code three priors: locality (kernels are small), translation equivariance (the same kernel everywhere), and hierarchical composition (depth + downsampling). These priors are correct for natural images, which is why CNNs sample-efficiently learn from small datasets. They are constraints you'd want to relax once data is no longer the bottleneck. The choice between CNN and ViT is mostly a question of how much data you can throw at the problem.

Equivariance beyond translation. For rotation, reflection, or scale equivariance, see group-equivariant CNNs (Cohen & Welling, 2016) and steerable CNNs — important in molecular property prediction, astronomy, and medical imaging where the orientation of an object carries no information.

Reach for it when

On-device / low-latency inference — depthwise-separable backbones still rule the FLOPs-per-accuracy frontier
Limited data — CNN priors do real work, ViTs need orders of magnitude more
Dense prediction (segmentation, depth, optical flow) — receptive-field engineering matters and is well-understood
Domains with clear locality and translation invariance — natural images, medical scans, spectrograms

Skip it when

You need genuinely long-range reasoning the receptive field can't reach in a few layers
You want a single architecture across modalities — transformers generalise more naturally
You have ImageNet-scale data and a TPU pod budget — a well-trained ViT or hybrid usually wins
The task isn't equivariant in any obvious sense (text, tabular, symbolic)

import torch, torch.nn as nn

# ConvNeXt-style block: depthwise + LayerNorm + inverted bottleneck + residual
class ConvNeXtBlock(nn.Module):
    def __init__(self, dim, expand=4):
        super().__init__()
        self.dwconv = nn.Conv2d(dim, dim, kernel_size=7, padding=3, groups=dim)
        self.norm   = nn.LayerNorm(dim, eps=1e-6)
        self.pw1    = nn.Linear(dim, expand * dim)
        self.act    = nn.GELU()
        self.pw2    = nn.Linear(expand * dim, dim)
    def forward(self, x):
        residual = x
        x = self.dwconv(x)                          # (B, C, H, W)
        x = x.permute(0, 2, 3, 1)                   # → (B, H, W, C) for LayerNorm
        x = self.norm(x)
        x = self.pw2(self.act(self.pw1(x)))
        x = x.permute(0, 3, 1, 2)                   # back to (B, C, H, W)
        return residual + x

Too dense?

What the figure shows

Top — a single 3×3 kernel (9 numbers) slides over the whole image; at every position it computes a weighted sum that becomes one cell of the feature map. Use a preset filter runs hand-designed kernels (Sobel edges, blur, sharpen); click any cell to cycle its value. Learn the filter starts from 9 random numbers and runs gradient descent toward a chosen target output — the weights aren't designed, they're learned from a loss signal, exactly like every other weight in a network.

Bottom — pooling downsamples the feature map: each 2×2 block becomes one output cell (max or mean). Same channels, half the spatial size. Stack conv + pool a few times and what was a 200×200 image becomes a 7×7 grid of high-level features. Hover either side of the pool to see which input block becomes which output cell.

The same kernel applies at every position — that's weight sharing, the reason CNNs need so few parameters compared to a dense net.

Where to learn more

CNN Explainer — poloclub A real CNN running in your browser. Click any feature-map cell to trace exactly which input pixels produced it. The single best interactive resource for seeing what each layer does.
Harley — 3D Visualization of an MNIST CNN Rotate, zoom, and inspect the volume of activations at every layer of a small CNN classifying your own hand-drawn digits. Older but still uniquely good for the spatial structure.
Olah et al. — Feature Visualization (Distill) The seminal article on what individual CNN neurons "see" — synthesised images that maximally activate each filter. Beautiful interactive diagrams; pairs well with the follow-up Building Blocks of Interpretability.
Zhang et al. — Dive into Deep Learning, CNN chapters Code-first walkthrough of convolutions, padding/stride, pooling, LeNet, AlexNet, VGG, GoogLeNet, ResNet, DenseNet — every classical architecture rebuilt from scratch with runnable notebooks.
CS231n — Stanford The gold-standard course on CNNs for vision. The lecture notes alone are worth working through even without the videos.
He et al. (2015) — Deep Residual Learning The ResNet paper. Skip connections — one of the most consequential ideas in modern deep learning. Short, readable, and the experiments speak for themselves.
Araujo et al. — Computing Receptive Fields (Distill) Interactive walkthrough of how receptive fields actually grow with depth, dilation, and stride — including the often-surprising effective receptive field. Essential for designing dense-prediction networks.
Liu et al. (2022) — A ConvNet for the 2020s (ConvNeXt) A CNN rebuilt with modern conventions (LayerNorm, depthwise convs, GELU, larger kernels). Matches ViT performance — useful proof that the CNN-vs-ViT gap was mostly about training recipe, not architecture.
Dosovitskiy et al. (2020) — An Image Is Worth 16×16 Words (ViT) The Vision Transformer paper — read this to understand what CNNs are being compared against, and why scale changes the answer.