Why a hidden state
A plain feedforward net sees one fixed-size input and emits one output — it has no notion of "before" or "after". For sequences (sentences, audio frames, sensor readings) the meaning of token t depends on everything that came before it. An RNN solves this by carrying a hidden state — a vector that gets updated at every step. The same weights are reused at every timestep, so the network is effectively a tiny program running in a loop, with the hidden state acting as its working memory.
Why RNNs are mostly replaced now
Two problems killed them in NLP. First, vanishing gradients: when you backpropagate through hundreds of timesteps, gradients are multiplied by the same matrix over and over and decay to zero — the network can't learn long-range dependencies. Second, no parallelism across timesteps: each step has to wait for the previous one's hidden state, so you can't fill a GPU. Transformers sidestep both by replacing recurrence with attention, which sees every position at once.
Where they still make sense
- Streaming and real-time inference. You process one step as it arrives — perfect for a microphone, a network packet stream, or a sensor.
- Tiny models on tiny hardware. A 100k-parameter LSTM fits on a microcontroller; a transformer of comparable quality does not.
- Bounded memory. The hidden state has fixed size regardless of sequence length. Attention is O(N²) in sequence length; an RNN is O(N) with constant memory per step.
- Online learning. You can keep training on a never-ending stream without re-batching.
LSTM and GRU, intuitively
A vanilla RNN overwrites its memory every step — which is why it forgets. LSTM (Hochreiter & Schmidhuber, 1997) adds three small "gates" — forget, input, output — each a little sigmoid network that outputs values between 0 and 1, acting like dimmer switches on the memory. The cell decides how much of the old memory to keep, how much new content to write in, and how much of the result to expose. Crucially, the memory has its own additive update path, so gradients survive long sequences. GRU is the same idea trimmed down to two gates and one state — usually as good, faster to train.