Gaussian Processes — Layerwise ML

Mode

Key idea

Instead of one guess, give me a range. A Gaussian process predicts both a value and how confident it is about that value. Where you have lots of nearby data, it's confident; where you don't, it widens its uncertainty.

Click to add observations · shift+click to remove · slide ℓ to change how far each point's influence reaches

Kernel ℓ σₙ ℓ = 0.25 σₙ = 0.05

The bold line is the posterior mean; the shaded band is the ±2σ uncertainty. The thin lines are four sample functions drawn from the posterior — they show what kinds of functions are still consistent with your data. Clear all observations and you're seeing the prior — the GP's belief before any data. Add a point and watch the band collapse to it (within σₙ); the uncertainty inflates the further you go.

Most regression models give you one number per prediction. A Gaussian process gives you a number and an error bar — and the error bar is data-aware: it shrinks near your training points and grows when you predict far away from anything you've seen.

That uncertainty is the whole point. GPs are slow and don't scale to big data, but when you have a few hundred points and you care about knowing what you don't know — design of experiments, Bayesian optimization, robotics — they're the right tool.

Reach for it when

You have small data (≲1000 points)
Calibrated uncertainty matters
You want a non-parametric model that adapts to the data
You're doing Bayesian optimization

Skip it when

You have a lot of data (GPs are O(N³))
You only need point predictions, not uncertainty
The signal is well-behaved and a simpler model would do
You're working with images or text (use neural nets)

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF

gp = GaussianProcessRegressor(kernel=RBF(length_scale=1.0))
gp.fit(X_train, y_train)

# Predict with uncertainty
y_mean, y_std = gp.predict(X_test, return_std=True)

Want to see what's actually happening under the hood?

Key idea

$$ f(\mathbf{x}) \;\sim\; \mathcal{GP}\!\left(m(\mathbf{x}),\; k(\mathbf{x}, \mathbf{x}')\right) $$

m(x)mean function (often zero)
k(x,x')kernel — encodes how similar outputs should be when inputs are similar

$$ \text{unknown function } f \;\sim\; \text{GP}(\text{mean function},\; \text{kernel}) $$

In words. Read the ∼ as "is drawn from" — a Gaussian process is a probability distribution over functions, and f is a single sample from it. Two ingredients describe the distribution. The mean function m(x) is what the function looks like on average at input x (people usually set this to zero and let the data do the work). The kernel k(x, x') is a similarity score between two inputs: it controls how correlated the function's outputs are when the inputs are close. A smooth kernel produces smooth functions; a periodic kernel produces wavy ones. The whole machinery rests on one property — any finite set of function values you ask about is a jointly Gaussian random vector, which is exactly what makes the math close-form-tractable.

fthe unknown function you're trying to learn — a single draw from the GP
mean functionthe GP's prior expectation of f(x); usually zero
kernela similarity function — nearby inputs → highly correlated outputs
∼"is drawn from" — f is one sample from a distribution over functions

A GP is a distribution over functions such that any finite set of function values is jointly Gaussian. The kernel determines what kind of functions are likely — smooth, periodic, rough, etc.

The kernel k is the heart of the model. The most common one — the RBF kernel k(x, x') = exp(−‖x − x'‖² / 2ℓ²) — assumes smooth functions; nearby inputs produce similar outputs, with "nearby" controlled by the length scale ℓ.

Given training data, the posterior over f(x_*) at a test point is also Gaussian, in closed form:

μ = k^T(K + σ²I)⁻¹y σ² = k(x,x) − k^T(K + σ²I)⁻¹k_*

That matrix inverse is the source of the O(N³) cost — it's what kills GPs on large data.

Reach for it when

Small to medium data with strong prior beliefs about smoothness
You want principled hyperparameter selection via marginal likelihood
Active learning / Bayesian optimization (need uncertainty to pick next point)
Time-series with seasonality (composable kernels)

Skip it when

N > a few thousand — use sparse approximations or switch models
Functions are non-stationary in ways the kernel can't capture
You have categorical features (need custom kernels)
Output is high-dimensional (vector-valued GPs are awkward)

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel, ConstantKernel as C

# Composable kernel: amplitude * RBF + noise
kernel = C(1.0) * RBF(length_scale=1.0) + WhiteKernel(noise_level=0.1)

gp = GaussianProcessRegressor(
    kernel=kernel,
    n_restarts_optimizer=10,   # random restarts for marginal-likelihood optimization
    normalize_y=True,
).fit(X_train, y_train)

print("Fitted kernel:", gp.kernel_)
print("Log-marginal likelihood:", gp.log_marginal_likelihood(gp.kernel_.theta))

y_mean, y_std = gp.predict(X_test, return_std=True)

Want sparse approximations and the kernel-engineering tradeoffs?

Marginal likelihood (hyperparameter learning)

$$ \log p(\mathbf{y} \mid X, \boldsymbol{\theta}) \;=\; -\tfrac{1}{2}\mathbf{y}^\top K_\theta^{-1}\mathbf{y} \;-\; \tfrac{1}{2}\log\!|K_\theta| \;-\; \tfrac{n}{2}\log 2\pi $$

Kθcovariance matrix induced by kernel hyperparameters θ
fit term — yTK-1y penalizes poor data fit
complexity term — log|K| penalizes overly-flexible kernels

$$ \log p(\text{targets} \mid \text{kernel params}) \;=\; -\tfrac{1}{2} \underbrace{\mathbf{y}^\top K^{-1} \mathbf{y}}_{\text{data fit}} \;-\; \tfrac{1}{2} \underbrace{\log |K|}_{\text{complexity}} \;-\; \tfrac{n}{2} \log 2\pi $$

In words. This is the log-probability of the observed targets y under the GP, with the function f integrated away — leaving only the kernel hyperparameters θ (length scale, amplitude, noise) to tune. Three terms. The data-fit term yTK−1y punishes you when the targets don't lie comfortably in the kinds of functions the kernel allows — K−1 is the inverse of the n × n kernel matrix at the training inputs. The complexity term log|K| (the log-determinant) punishes overly flexible kernels — a richer kernel matches anything but spreads probability thin. The final n log 2π is a normalisation constant that doesn't depend on θ. Maximise this whole expression over θ and you get a built-in Occam's razor — fit vs. complexity, balanced automatically.

targetstraining labels y
kernel paramshyperparameters θ like length scale, amplitude, noise
Kkernel matrix evaluated at every pair of training inputs
K−1matrix inverse — the O(N³) cost lives here
log |K|log-determinant of K — the complexity penalty
nnumber of training points

The log marginal likelihood automatically trades off fit vs. complexity — a built-in Occam's razor. Optimize it w.r.t. kernel hyperparameters to set length scales, amplitudes, and noise simultaneously.

Scaling. The cubic cost of inverting the N×N kernel matrix is the headline limitation. Sparse approximations represent the GP via m ≪ N inducing points and reduce cost to O(Nm²). FITC and DTC are early variants; SVGP (variational, Hensman et al.) is the modern workhorse and scales to millions of points via stochastic optimization.

Kernel composition. Sums and products of valid kernels are valid kernels. This lets you encode structure: RBF + Periodic for time series with seasonality, RBF × Linear for trends, etc. Automatic kernel discovery (Duvenaud et al.) tries to compose these structures from data.

Connection to neural networks. An infinitely-wide neural network with random Gaussian weights converges to a GP (Neal, 1996). The NTK (neural tangent kernel) describes the GP that trained networks effectively act as in the infinite-width limit — connecting GPs to deep learning theory.

Likelihood freedom. The closed-form posterior requires Gaussian likelihood. For classification or other non-Gaussian likelihoods, approximate inference is needed: Laplace, EP (expectation propagation), or variational methods.

Reach for it when

Bayesian optimization / experimental design
Spatial statistics (kriging is a GP)
You want interpretable hyperparameters (length scale = "how far is similar?")
Multi-fidelity modelling — cheap and expensive evaluations

Skip it when

You can't choose a sensible kernel (no inductive bias)
You need extrapolation — RBF kernels decay to the prior mean
Strict inference latency — sparse GPs are still slower than NNs
Heavy-tailed noise — Gaussian likelihood is too sensitive to outliers

import gpytorch, torch

class ExactGP(gpytorch.models.ExactGP):
    def __init__(self, X, y, likelihood):
        super().__init__(X, y, likelihood)
        self.mean   = gpytorch.means.ConstantMean()
        self.cov    = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())
    def forward(self, x):
        return gpytorch.distributions.MultivariateNormal(self.mean(x), self.cov(x))

likelihood = gpytorch.likelihoods.GaussianLikelihood()
model      = ExactGP(X_train, y_train, likelihood)
mll        = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)

# Optimize marginal likelihood via Adam
model.train(); likelihood.train()
opt = torch.optim.Adam(model.parameters(), lr=0.1)
for _ in range(200):
    opt.zero_grad()
    loss = -mll(model(X_train), y_train)
    loss.backward()
    opt.step()

Too dense?