Find the points that don't look like the rest — fraud, defects, intrusions, the rare and weird.
Mode
Key idea
Learn what "normal" looks like; flag what doesn't. Anomaly detection is unsupervised — you usually don't have labelled anomalies, just the assumption that most of your data is normal and a few rare points are not.
Slide the threshold — every point below the cutoff density gets flagged in terracotta
τ = 0.25h = 0.12
The cream-to-indigo heatmap is a kernel density estimate of "where normal lives". Points are flagged as anomalies when their local density score drops below the threshold τ. Drop the bandwidth h and the model becomes sensitive to local quirks (over-fits); raise it and only the most isolated points stand out. The Ring dataset is a classic — the genuinely anomalous points live inside the ring, where most distance-based methods would happily call them "central and normal".
Three classical strategies. Density-based: a point with low probability under a model of "normal" is anomalous. Distance-based: a point far from its nearest neighbours is anomalous. Reconstruction-based: train a model to compress and reconstruct normal data; points it reconstructs badly are anomalous.
The right choice depends on what "anomalous" means in your domain — a fraudster looks different from a manufacturing defect looks different from a network intrusion.
Reach for it when
Fraud / intrusion / defect detection
You have plenty of normal data but few or no labelled anomalies
Monitoring sensor data for unusual patterns
Cleaning a dataset of outliers before modelling
Skip it when
You have labels for both classes — train a regular classifier (with class weights)
Anomalies are common enough to balance — it's just classification
"Anomalous" isn't well-defined and changes over time
You need to explain why a specific point was flagged
from sklearn.ensemble import IsolationForest
iso = IsolationForest(contamination=0.05, random_state=0).fit(X_train)
# -1 = anomaly, 1 = normal
labels = iso.predict(X_test)
scores = iso.score_samples(X_test) # lower = more anomalous
In words. Three ways to assign a "weirdness" score to a point x. Density: how (im)probable is x under a model of normal data? −log p(x) is the "surprise" — bigger means rarer. Distance: how far is x from its k nearest neighbours (written NNk(x))? Lone points are anomalous. Reconstruction: train a compressor on normal data; if it can't faithfully reconstruct x, then x doesn't look normal. Each strategy picks a different definition of "weird"; pick the one that matches how your anomalies actually behave.
densitynegative log probability under a model of normal data — bigger = rarer
distancehow far x is from its k nearest neighbours
reconstructionhow badly an autoencoder reconstructs x
Each strategy gives a continuous score; threshold to decide normal vs. anomalous
Isolation Forest. Builds random trees that split features at random thresholds. Anomalies — being in sparse regions — get isolated in fewer splits. The score is average path length to isolation. Cheap, scales well, and works in moderate dimensions. Default for tabular anomaly detection.
One-Class SVM. Fits a decision boundary around the "normal" data in a kernel feature space, treating the origin as the "anomaly side". Good with small data and a sensible kernel; doesn't scale to big data.
Local Outlier Factor (LOF). Compares each point's local density to its neighbours' local densities. Catches anomalies in heterogeneous-density data that global methods miss. Score > 1 means lower density than neighbours.
Autoencoder reconstruction. Train an autoencoder on normal data; at inference time, flag points with high reconstruction error. Scales to images and high-dim data where other methods struggle.
Threshold setting. All methods give continuous scores; the threshold is a business decision (precision vs. recall trade-off). With no labels, use a quantile of training scores; with some labels, calibrate against the validation set.
In words. Anomalies come in three nested types of increasing subtlety. A point anomaly is a single observation that's odd regardless of anything around it (a credit card charge in a country you've never visited). A contextual anomaly is odd only in the right context — 25°C is normal in summer, anomalous in winter; you need to know the season to flag it. A collective anomaly requires looking at a whole group together — no single heart-rate reading is unusual, but the whole sequence drifting upward over an hour is. Each type subsumes the previous and is harder to detect.
pointone observation odd on its own
contextualodd given the context (time of day, location, season)
collectivea sequence or group is anomalous as a whole, even if each piece looks normal
Each type is harder to detect than the previous
Deep one-class methods. Deep SVDD (Ruff et al.) replaces the kernel feature map with a learned neural network and shrinks the data into a small hypersphere. Trained end-to-end; works for images and time series. Watch for representation collapse (every input maps to the centre).
Generative anomaly detection. Train a generative model (GAN, normalizing flow, diffusion) on normal data; anomalies have low likelihood or low-quality reconstructions. State of the art on industrial defect detection (MVTec AD). Caveat: deep generative models do not reliably assign low likelihood to out-of-distribution inputs — see Nalisnick et al. 2019.
Sequence and time-series. Forecast-based methods flag points where the prediction error exceeds a threshold (ARIMA residuals, Prophet, deep forecast models). Reconstruction-based methods (LSTM autoencoders, transformer denoisers) work for collective anomalies — flag a window whose reconstruction is poor.
Calibration. Anomaly scores are not probabilities. Convert via percentile-based mapping or fit a tail distribution (Generalized Pareto). PR-AUC is the right summary metric in the heavily-imbalanced regime; ROC-AUC overstates performance.
Evaluation pitfalls. Without labels, you're estimating performance from synthetic anomalies — which often don't match real ones. With labels, watch out for label leakage from the threshold-setting process. Always evaluate on a held-out period for time-series.
Reach for it when
Deep SVDD: images, learned representations, end-to-end pipeline
Normalizing flows: need calibrated densities, not just scores
LSTM / transformer reconstruction: sequential data with structure
Density-ratio: compare against a known reference distribution
Skip it when
You truly have labels — use supervised methods with class weights / focal loss
"Normal" is multi-modal and rare — single-class methods overfit one mode
Anomalies must be human-interpretable — deep methods are opaque
You can't retrain regularly and the data drifts
import torch
import torch.nn as nn
class AutoencoderAD(nn.Module):
def __init__(self, d_in):
super().__init__()
self.enc = nn.Sequential(nn.Linear(d_in, 64), nn.ReLU(), nn.Linear(64, 16))
self.dec = nn.Sequential(nn.Linear(16, 64), nn.ReLU(), nn.Linear(64, d_in))
def forward(self, x):
return self.dec(self.enc(x))
# Train ONLY on normal data
model = AutoencoderAD(d_in=X_normal.shape[1])
opt = torch.optim.Adam(model.parameters(), lr=1e-3)
for _ in range(100):
opt.zero_grad()
loss = ((model(X_normal) - X_normal) ** 2).mean()
loss.backward(); opt.step()
# Anomaly score = per-sample reconstruction error
with torch.no_grad():
err = ((model(X_test) - X_test) ** 2).mean(dim=1)
threshold = err[y_test == 0].quantile(0.99) # top 1% of normal training errors
flagged = err > threshold