Why we need activation functions
Try toggling activation to none in the figure. Every prediction collapses — and that's the point. Without a nonlinearity between layers, two matrix multiplications compose into a single matrix multiplication: W₂ · (W₁ · X) = (W₂ · W₁) · X. Stack 100 layers of pure matmuls and you still have exactly the expressive power of one. The activation (ReLU, sigmoid, GELU, …) breaks that collapse — each layer can bend the representation in a new way, and only then does deep mean anything.
How to choose an architecture
Two main knobs: width (neurons per layer) and depth (how many layers).
- Wider layers carry more parallel features at once — useful when many independent things matter.
- Deeper stacks build features compositionally — later layers reuse what earlier ones discovered.
For tabular problems, a few hidden layers of 64–256 units is usually plenty. For images, sequences, or graphs, the architecture itself encodes the data's structure — reach for a CNN, RNN, transformer, or GNN. In practice the hardest knobs are rarely depth or width — it's getting the input scale, regularisation, and learning rate right.
Why going deep works
Each layer can learn an abstraction built from the previous layer's features. In an image classifier: layer 1 picks up edges, layer 2 corners, layer 3 textures, layer 4 object parts, layer 5 whole objects. In language: characters → words → phrases → meaning. The universal approximation theorem (switch to the In-depth tier for the formal version) says one sufficiently wide hidden layer can in principle approximate any continuous function — but in practice deep narrow networks generalise far better than shallow wide ones with the same parameter count, because hierarchical composition lets the network reuse intermediate features instead of memorising every input combination separately.
Training: how the weights got there
The weights in the figure didn't appear by magic — they were learned. You define a loss (how wrong the predictions are), compute the gradient of the loss with respect to every weight using the chain rule (this procedure is called backpropagation), and nudge each weight a tiny step in the direction that lowers the loss. Repeat for millions of steps. The figure freezes one moment in that process; training is what carved the colours you see.
→ Full mechanics in Gradient Descent and Optimisation Algorithms.