The four heads above show the kinds of patterns real transformers actually learn: simple positional shifts, plus richer linguistic relationships like adjective→noun and verb→subject. A real model has dozens of these per layer, all running in parallel, and the next layer can compose their outputs. That composition is most of what makes transformers work.
Why attention beats recurrence
An RNN reads tokens one at a time, squeezing the past into a fixed hidden state. Two problems: the early tokens get crushed by everything that comes after (long-range information vanishes), and you can't parallelise the forward pass — token t needs token t-1 to finish first. Attention sidesteps both. Every token sees every other token in one matrix multiplication, so the path length between any two positions is constant and the whole sequence runs on the GPU at once. You pay an O(N²) bill for the privilege, but for sequences up to tens of thousands of tokens it's worth it.
What attention is doing intuitively
Think of a soft database lookup. Each token emits a query ("what am I looking for?"), every token also exposes a key ("here's what I am") and a value ("here's what I'd contribute"). The dot product query · key scores how well a query matches each key; softmax turns those scores into weights that sum to 1; the output is a weighted average of the values. Hard lookup would pick the single best match — attention picks all of them, gently, in proportion to how well they match. The whole thing is differentiable, so the model learns what to look for.
Why multi-head
One attention pattern per layer would be a brutal bottleneck. Instead, split the embedding into h slices and run h independent attention computations in parallel — one head can track syntax, another coreference, another raw position, another semantic similarity. Concatenate the outputs and project back. Same FLOPs as a single big head, vastly more expressive.
Why position encoding
Attention is permutation-invariant — without help, "the cat sat on the mat" and "the mat sat on the cat" produce identical outputs. So we inject position information into the token embeddings themselves: either added as a sinusoidal vector (original paper) or, in modern models, baked into the Q/K projections via a rotation (RoPE). Without it, the architecture has no idea what order anything is in.