Hidden Markov Models — Layerwise ML

Mode

Key idea

A system you can't see directly, but you can see its outputs. The state changes over time according to its own rules. You only observe noisy emissions. Given the emissions, you infer the hidden states.

Click an activity icon to flip it — the hidden-state posterior and Viterbi path re-shuffle to explain the new evidence

Sequence

The toy model has three hidden weather states (Sunny / Cloudy / Rainy) that you never observe directly — only the activity I picked that day (Walk / Shop / Clean). Each state has its own emission preference, and they tend to persist (Sunny is more likely to stay Sunny than flip straight to Rainy). The three colour-tinted rows show the forward-backward posterior: how strongly the model believes each state was active given the entire sequence. The dark line is the Viterbi most-likely path — the single best explanation rather than per-step marginals.

Classic example: speech recognition. The "hidden state" is the phoneme being spoken; the "observation" is the audio signal. You don't see the phoneme directly, only the sound. The model captures (a) which phonemes tend to follow which (transition probabilities) and (b) what each phoneme tends to sound like (emission probabilities).

HMMs have been largely replaced by RNNs and transformers in speech and NLP, but they're still the right tool when (a) the latent states have a real interpretation you care about, (b) data is small, or (c) you want exact inference rather than approximate.

Reach for it when

The hidden states correspond to real-world categories you care about
Small data — exact inference is feasible and helps
Biological sequence analysis (gene finding, protein structure)
You need a probabilistic story you can interrogate

Skip it when

Lots of data and you don't care about latent-state interpretability — use RNN / transformer
State transitions depend on long-range context (Markov assumption breaks)
Emissions are high-dimensional images or text
You want end-to-end learning of representations

from hmmlearn import hmm

# 3 hidden states, Gaussian emissions
model = hmm.GaussianHMM(n_components=3, covariance_type="full")
model.fit(observations)   # shape: (n_samples, n_features)

# Decode: most likely hidden-state sequence (Viterbi)
hidden_states = model.predict(observations)

# Score: log-likelihood of the observed sequence under the model
print(f"log p(obs) = {model.score(observations):.2f}")

Want the three classical problems and the algorithms?

The model

$$ p(\mathbf{x}_{1:T}, \mathbf{z}_{1:T}) \;=\; \pi(z_1)\,\prod_{t=2}^{T} A(z_t \mid z_{t-1}) \,\prod_{t=1}^{T} B(x_t \mid z_t) $$

πinitial state distribution
Astate transition matrix
Bemission distribution (Gaussian, multinomial, …)
zthidden state at time t
xtobserved emission at time t

$$ p(\text{observations}, \text{hidden states}) \;=\; \underbrace{P(z_1)}_{\text{start}} \;\times\; \prod_t \underbrace{P(z_t \mid z_{t-1})}_{\text{transition}} \;\times\; \prod_t \underbrace{P(x_t \mid z_t)}_{\text{emission}} $$

In words. The probability of seeing a particular sequence of observations together with a particular sequence of hidden states factors into three pieces. Start with π (pi) — the probability of whatever hidden state you happen to begin in. Then for every following step, multiply in the transition probability A(zt | zt-1) — how likely it is to move to the new hidden state given the previous one. Finally, for every step, multiply in the emission probability B(xt | zt) — how likely the observation you saw is given the hidden state you were in. The two big products (∏) just say "multiply across all time steps." The whole expression is built on the Markov assumption: the future depends on the present only, not the past.

observationswhat you actually see, x₁, x₂, …, x_T
hidden stateswhat's really going on underneath, z₁, z₂, …, z_T
startπ — initial state distribution
transitionA — probability of moving from one hidden state to the next
emissionB — probability of producing each observation from a hidden state

Three classical problems, each with its own algorithm:

1. Evaluation — given a model, how likely is an observed sequence? Use the forward algorithm. Dynamic programming, O(T·K²) where K is the number of states.

2. Decoding — given a model and observations, what's the most likely hidden state sequence? Use the Viterbi algorithm. Same complexity as forward, but tracks the argmax path.

3. Learning — given observations only, fit the model parameters. Use the Baum-Welch algorithm (EM specialized to HMMs). Iterate: E-step computes posterior over latent states via forward-backward; M-step updates π, A, B.

Without the Markov assumption, all of these would be exponential in T. The local conditioning is what makes HMMs tractable.

Reach for it when

Tagging tasks where labels follow a Markov chain (POS tagging, NER)
Acoustic modelling for small-vocabulary speech
Bioinformatics (CpG islands, gene structure, profile HMMs)
Activity recognition from sensor sequences

Skip it when

Long-range dependencies — RNN / transformer captures them better
Continuous high-dim emissions — better handled by deep generative models
You need to learn the state space (not just its parameters)
Labels are observed at every step — use CRF / sequence tagger instead

from hmmlearn import hmm
import numpy as np

# Multinomial emissions for discrete observations
model = hmm.CategoricalHMM(n_components=3, n_iter=100, random_state=0)
model.fit(X, lengths=lengths)    # lengths splits concatenated sequences

# Posterior probability of each state at each timestep
log_prob, posteriors = model.score_samples(X)

# Viterbi: most likely state sequence
states = model.predict(X)

print("Transition matrix:")
print(np.round(model.transmat_, 3))

Want forward-backward, max-product, and the CRF connection?

Forward recursion

$$ \alpha_t(j) \;=\; B(x_t \mid z_t{=}j) \;\sum_{i=1}^{K} \alpha_{t-1}(i)\, A(i \to j) $$

αt(j)joint probability of observations 1..t and hidden state j at t
Combined with the analogous backward recursion β_t(j), you get the full posterior γ_t(j) at each step

$$ \alpha_t(j) \;=\; P(\text{observation at } t \mid \text{state } j) \;\times\; \sum_{i} \alpha_{t-1}(i) \cdot P(\text{move } i \to j) $$

In words. αt(j) (alpha — Greek letter used here as a name for this quantity) is the joint probability of having seen observations 1 through t and being in hidden state j at time t. You build it step by step. To get to state j at time t, you must have been in some state i at the previous step — so the Σ ("sum over all possible previous states i") adds up every way of arriving. Multiply by the transition probability of that move, then by the emission probability of seeing xt given you're now in state j. Running this recursion across t = 1, 2, …, T costs O(T·K²) — much cheaper than the exponential cost of enumerating every hidden path. The analogous backward recursion sweeps right-to-left, and together they give you the per-step posterior over states.

αt(j)probability of the first t observations and ending in state j at time t
Knumber of hidden states
icandidate previous hidden state — the sum runs over all of them
move i → jtransition from state i to state j (one entry of the matrix A)
Cost is O(T·K²): for each time step, for each pair of states

Sum-product vs max-product. Forward-backward and Viterbi are the same dynamic program with different semiring operations: forward sums over paths to compute p(x); Viterbi maxes over paths to find the best one. Both are O(T·K²) and both run in log-space in practice to avoid underflow.

Baum-Welch. The EM algorithm for HMMs. The E-step uses forward-backward to compute expected state and transition counts; the M-step normalizes those counts to update π, A, and the emission parameters. Local optima are common; warm-start with k-means or supervised pre-training if you have it.

HMMs vs. CRFs. HMMs are generative (model p(x, z)); CRFs are discriminative (model p(z | x) directly). CRFs win when you can engineer good features over the observations and have labels for training. HMMs win when labels are scarce or you want to do unsupervised structure discovery.

Generalizations. HSMMs (hidden semi-Markov) allow non-geometric state durations. Factorial HMMs decompose the state into multiple chains. Input-output HMMs condition transitions and emissions on external inputs. Beyond that, you've reinvented an RNN — at which point you probably should just use one.

Reach for it when

You need exact inference and have computational budget for it
Latent structure interpretation is the goal (e.g. regime detection)
Probabilistic outputs feed downstream Bayesian computation
Domain knowledge dictates a specific transition structure

Skip it when

You can afford to label sequences — train a sequence tagger / transformer instead
The state space is unknown and continuous
You need to model variable-rate sequences with strong dependencies
You want to backprop through the model end-to-end (use a differentiable variant)

import numpy as np

def forward_log(log_pi, log_A, log_B):
    """Forward in log-space. log_B has shape (T, K) — log emission probs."""
    T, K = log_B.shape
    log_alpha = np.empty_like(log_B)
    log_alpha[0] = log_pi + log_B[0]
    for t in range(1, T):
        # logsumexp trick to stay numerically stable
        log_alpha[t] = log_B[t] + _logsumexp(log_alpha[t-1, :, None] + log_A, axis=0)
    return log_alpha

def _logsumexp(x, axis):
    m = x.max(axis=axis, keepdims=True)
    return (m + np.log(np.exp(x - m).sum(axis=axis, keepdims=True))).squeeze(axis)

Too dense?