Learn by doing — no labels, only rewards. The framework that powers game-playing agents, robotics, and RLHF.
Mode
Key idea
An agent acts in an environment; the environment gives reward. No labels. No supervised target. The agent has to figure out, from sparse and often delayed reward signals, what good behaviour looks like. The classical loop: observe state, choose action, receive reward and new state, repeat.
Watch tabular Q-learning learn a grid-world policy — Q-values flood backward from the goal as episodes accumulate
ε = 0.20episode 0
A 5×5 grid world. The agent (indigo dot) starts top-left and gets +1 for reaching the goal (orange square) and -1 for stepping on a wall (grey). Each cell shows the max Q-value across actions — colour intensity = how good that state is. Watch the "warmth" flood backward from the goal as more episodes play out. Drop ε to greedy (almost no exploration); the agent gets stuck.
The RL loop. State s → action a → reward r → next state s'. Repeat. The agent learns a policy π(a | s) that maximises expected discounted future reward.
Q-learning. Learn a value Q(s, a) = "expected discounted reward if I take action a in state s and act greedily afterwards." Update rule: Q(s, a) ← Q(s, a) + α · (r + γ maxa' Q(s', a') − Q(s, a)).
Exploration vs exploitation. ε-greedy: with probability ε pick a random action, otherwise pick the best Q. Too much ε → never converges. Too little → stuck in local optima.
Policy gradient. Parametrize the policy directly (πθ(a | s)) and follow the gradient of expected reward. REINFORCE, A2C, PPO. Works when actions are continuous or the action space is huge.
Modern deep RL. Replace the Q-table with a neural network → DQN. Add policy networks → A2C, PPO. Add a value-network critic + replay buffer → SAC. The pieces are old; the engineering is what makes them work.
Reach for it when
Sequential decisions with delayed reward
Game-playing, robotics, control
RLHF — aligning models to human preferences
No labelled data, but a simulator or rollout mechanism
Limits
Sample-inefficient — usually needs millions of rollouts
Unstable training — exploration / exploitation balance is fragile
Reward design is hard — the agent will exploit whatever you wrote
Sim-to-real gap — policies that work in simulation often fail on hardware
import numpy as np
import gymnasium as gym
env = gym.make("FrozenLake-v1", is_slippery=False)
Q = np.zeros((env.observation_space.n, env.action_space.n))
alpha, gamma, eps = 0.1, 0.95, 0.2
for episode in range(5000):
s, _ = env.reset()
done = False
while not done:
a = np.random.randint(4) if np.random.random() < eps \
else Q[s].argmax()
s_next, r, done, *_ = env.step(a)
Q[s, a] += alpha * (r + gamma * Q[s_next].max() - Q[s, a])
s = s_next
# Greedy policy
policy = Q.argmax(axis=1)
Want MDPs, Bellman equations, and actor-critic?
Bellman optimality
$$ Q^*(s, a) = \mathbb{E}_{s'}\!\left[\, R(s, a) + \gamma \max_{a'} Q^*(s', a') \,\right] $$
Q*(s, a)optimal action-value: best possible expected return from state s after taking action a
R(s, a)immediate reward for taking a in s
γdiscount factor (0 = myopic; 1 = far-sighted)
Es'[·]average over the next state s' drawn from the environment's dynamics
In words. The best possible value of taking an action in a state is whatever reward you collect right now, plus a discounted estimate of the best value you can achieve from wherever you end up next. The Es' averages over the randomness in where you land. The max says you assume you'll act optimally from then on. γ (gamma, between 0 and 1) is the discount — it shrinks rewards that are further in the future, so the agent prefers earlier payoffs. This is a self-referential equation: Q* appears on both sides. Q-learning solves it by repeatedly nudging Q toward the right-hand side as the agent collects experience.
Q*optimal value of (state, action) — best expected total reward from that point on
reward nowthe immediate reward signal from taking that action
γdiscount factor between 0 and 1 — higher means longer planning horizon
avg over next statesaverage across the random next states the environment can transition to
max over next actionassume you'll pick the best action next time, too
MDPs. States S, actions A, transitions P(s' | s, a), rewards R(s, a, s'), discount γ. The agent's job is a policy π that maximises expected discounted return.
Value functions.Vπ(s) = expected return from s following π. Qπ(s, a) = expected return from s, taking action a, then following π. Both satisfy Bellman equations; both can be learned.
Model-free vs model-based. Model-free: learn V or Q directly from experience (Q-learning, SARSA, DQN). Model-based: learn the transition model and plan (Dyna-Q, MuZero). Model-based is more sample-efficient when the model is accurate; less so when it isn't.
Policy gradient methods. Parameterise πθ and take steps in the gradient of expected return: ∇θ J = E[Σt ∇θ log πθ(at | st) · A(st, at)] where A is the advantage. REINFORCE is the simplest form; A2C, A3C, PPO add stability.
Actor-critic. Combine policy gradient (actor) with a learned value function (critic). The critic reduces variance; the actor exploits it. Modern RL is mostly actor-critic in some form.
On-policy vs off-policy. On-policy (PPO, A2C): learn from data the current policy collected. Stable but sample-inefficient. Off-policy (DQN, SAC, DDPG): learn from any data (replay buffer). More sample-efficient; trickier to make stable.
import torch, torch.nn as nn, torch.nn.functional as F
# DQN: Q-function approximated by a neural net, replay buffer, target network
class DQN(nn.Module):
def __init__(self, state_dim, n_actions):
super().__init__()
self.net = nn.Sequential(nn.Linear(state_dim, 128), nn.ReLU(),
nn.Linear(128, n_actions))
def forward(self, x): return self.net(x)
# One training step
def dqn_step(net, target_net, batch, gamma=0.99):
s, a, r, s_next, done = batch
q = net(s).gather(1, a.unsqueeze(1)).squeeze(1)
with torch.no_grad():
q_next = target_net(s_next).max(dim=1).values
target = r + gamma * (1 - done) * q_next
return F.smooth_l1_loss(q, target)
Want PPO, soft actor-critic, RLHF, and the exploration zoo?
In words. The ratio is how much more likely the new policy is to take the action than the old policy was. The advantage is how much better than average that action turned out to be. Their product is the natural "policy gradient" term — make good actions more likely, bad actions less. The trick is the clip: it caps the ratio inside a tight band around 1 (typically 0.8 to 1.2), and the min picks the more pessimistic of the two surrogate objectives. The net effect: PPO can move the policy quickly when the advantage is small, but stops cold when the policy is already very different from the data-collection policy — which is what makes it stable.
rationew policy's probability of the action divided by the old policy's probability
advantagehow much better this action was than the average action at that state
clip(ratio, 1−ε, 1+ε)force the ratio into a narrow band — prevents huge updates
min(·, ·)pessimistic choice — caps both upside and downside of the update
εclip width (commonly 0.1 to 0.3)
PPO. Schulman et al. (2017). The de facto default for continuous control. Clipped ratio objective prevents large policy updates; multiple epochs over the same rollouts; advantage estimation via GAE. Practical, simple, robust.
SAC. Haarnoja et al. (2018). Off-policy actor-critic with entropy regularization — encourages exploration by maximising "expected return + policy entropy". Sample-efficient; the right choice for many continuous-action benchmarks.
MuZero. Schrittwieser et al. (2020). Learn the dynamics model in a latent space and plan with Monte-Carlo tree search. Achieves AlphaGo-level play without a hand-coded simulator. Beautiful theoretical synthesis.
Offline RL. Learn from a fixed dataset without environment access. Conservative Q-Learning (CQL), Implicit Q-Learning (IQL), Decision Transformer reformulate the problem as supervised sequence modelling. Hard because OOD actions can't be evaluated.
RLHF. Reinforcement Learning from Human Feedback. Train a reward model from human preferences over pairs of outputs; optimise an LLM against it with PPO. Made instruction-following LLMs possible. The pretrain → SFT → RLHF pipeline is now standard for assistant-style models.
Exploration. ε-greedy is the floor. Better: entropy bonuses (SAC), intrinsic motivation (curiosity-driven, ICM), Thompson sampling on the Q-distribution (Bootstrapped DQN), random network distillation (RND). The right method depends on the problem's reward sparsity.
The deep-RL stability cottage industry. Modern deep RL is a list of stability tricks: target networks, replay buffers, layer normalization, clipped gradients, learning-rate annealing, normalised observations, advantage normalization, … None alone is magic; together they make things barely work.