Why share weights across the image
A dense network treats every pixel as an independent feature: a 224×224 RGB image becomes a 150,528-dim vector, and a single hidden unit with 1,024 neurons already burns 150 million parameters before you've learned anything. Worse, the network has to relearn "an edge looks like an edge" separately at every position — it has no built-in notion that the top-left and bottom-right of an image play by the same rules.
A CNN replaces that with a tiny 3×3 (or 5×5) kernel that is reused at every spatial location. Two consequences fall out for free:
- Parameter count collapses. A 3×3 conv with 64 output channels has 9·Cin·64 weights regardless of image size. You can now train a deep network on modest data.
- Translation equivariance. Shift the input by one pixel and the feature map shifts by one pixel. The model doesn't need separate evidence for the same pattern at different positions.
Why pool / downsample
A single conv layer only sees a 3×3 patch. To recognise a face you need to integrate information from hundreds of pixels. The classic trick is pooling — replace each 2×2 block of the feature map with its max (or mean), halving the spatial dimensions. After a few rounds, what was a 224×224 image is a 7×7 grid of high-level features, each summarising a chunk of the original. Modern designs often skip pooling and use strided convolutions (jump 2 pixels at a time) for the same effect, with the benefit that the downsampling step is itself learned.
Why stack convolutions
A single 3×3 kernel can only see three pixels in any direction. But two stacked 3×3 convs see a 5×5 patch, three see a 7×7, and so on — the receptive field grows with depth. Combine that with downsampling and a deep CNN's top layers see the entire image, but built compositionally: edges from layer 1 → corners and curves from layer 2 → textures and parts from layer 3 → object-shaped responses near the top. This hierarchy is the reason CNNs work, and the reason their feature maps are interpretable in a way few other architectures are.
→ For the receptive-field formula and dilated convs, switch to the In-depth tier.