An ensemble of decorrelated decision trees, averaged or voted.
Mode
Key idea
Ask a crowd of decision trees, then take the majority vote. Each tree alone is shaky and makes its own mistakes, but the trees disagree in different places — so when you average their answers, the errors cancel out and the agreement points to the truth.
Slide N up — watch the ensemble boundary smooth out compared to the brittle individual trees below
N = 20
Each tree below was trained on a different bootstrap sample of the data with a random feature subset at every split — so they're all biased, all brittle, all wrong in different ways. But the big panel up top is the average of their votes. The wrongness cancels; the agreement reinforces. Slide N from 1 to 80 and watch the boundary go from jagged-and-overconfident to smooth-and-stable.
A decision tree is a flowchart of yes/no questions ("Is age > 30?" → "Is income > 50k?" → …) that ends in a prediction. It's intuitive, but it's also brittle: small changes in your training data can produce a wildly different tree.
A random forest builds many such trees, each on a slightly different sample of your data, and lets them vote. It's a bit like asking 100 doctors for a second opinion instead of trusting one — the consensus is more reliable than any individual.
Reach for it when
Your data is in rows and columns (a spreadsheet)
You want something that "just works" with minimal setup
You're not sure where to start
You want a reasonable accuracy estimate without extra effort
Skip it when
You need to explain why a specific prediction came out a certain way
Your data is images, text, or sequences (use neural networks)
You need predictions outside the range you trained on
from sklearn.ensemble import RandomForestClassifier
# Train: just give it labelled data
model = RandomForestClassifier(n_estimators=100, random_state=0)
model.fit(X_train, y_train)
# Predict
predictions = model.predict(X_test)
accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2%}")
$$ \text{forest prediction} \;=\; \text{average of all tree predictions} $$
In words. Ask every tree what it thinks about input x, then take the average of their answers (for classification, that's a majority vote; for regression, a numeric mean). B is how many trees you have — usually a few hundred. Each individual tree Tb was trained on a bootstrap re-sample of the data while only considering a random subset of features at every split, so the trees are deliberately decorrelated. Averaging cancels out their independent errors while preserving the signal they all agree on.
forest predictionthe ensemble's final answer at input x
tree predictionseach individual tree's vote or value for x
Bnumber of trees in the forest (typically 100–500)
Each tree is trained on a bootstrap sample with a random feature subset at each split. Both sources of randomness decorrelate the trees so the average cuts variance.
A single decision tree is high-variance — small changes to the training data produce very different trees. Random forests fight this by training many trees on slightly different views of the data, then averaging their predictions.
Two knobs do the work. Bootstrap sampling means each tree sees a different random sample (with replacement) of the training set. Feature subsetting means each split only considers a random subset of features (typically √p for classification, p/3 for regression). Together they ensure the trees disagree in independent ways.
Bonus: each bootstrap leaves out about ⅓ of the data per tree. Aggregating the trees that excluded each point gives an out-of-bag (OOB) estimate of generalization error — no cross-validation needed.
Reach for it when
Tabular data with mixed feature types
You need a strong baseline with almost no tuning
Robustness matters more than the last 2% of accuracy
You want a free OOB error estimate
Skip it when
Extrapolation outside the training range is required
Monotonicity constraints are required
You need per-prediction interpretability
High-signal tabular — gradient boosting usually wins
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(
n_estimators=200,
max_features="sqrt", # √p features considered per split
oob_score=True, # free generalization estimate
n_jobs=-1,
random_state=0,
)
clf.fit(X_train, y_train)
print(f"OOB accuracy: {clf.oob_score_:.3f}")
for name, imp in sorted(
zip(X_train.columns, clf.feature_importances_),
key=lambda x: -x[1],
)[:5]:
print(f" {name:20s} {imp:.3f}")
Want the bias-variance derivation and honest diagnostics?
In words. The variance of the averaged forecast splits into two pieces. The first is correlation (called ρ, rho — a number between 0 and 1 measuring how similarly trees vote on the same input) times the variance of a single tree. The second piece shrinks with the number of trees in the denominator — so adding more trees only reduces the second piece, while the first piece is a hard floor you can't escape no matter how many trees you grow. The whole point of feature subsetting is to push correlation ρ closer to zero so that floor is lower.
forest variancespread of the ensemble's prediction at a fixed input across re-fits
tree variancespread of a single tree's prediction at the same input
correlationhow similarly any two trees vote on the same input — lower is better
number of treesmore trees only shrink the second term; the first sets the floor
As B → ∞ the second term vanishes — variance is floored at ρσ². The whole game is to drive ρ down via per-split feature subsets without making each tree too weak (which inflates σ²).
Random forests are bagging applied to trees, with one twist: at each split, only a random subset of m features (out of p) is considered. This injects an extra source of decorrelation beyond bootstrap sampling alone, addressing the fact that standard bagging produces highly correlated trees whenever a few features dominate the splits.
The max_features hyperparameter trades off correlation (low m → low ρ) against individual tree strength (low m → higher σ²). Empirical defaults — √p for classification, p/3 for regression — work remarkably well across domains.
Out-of-bag error. Each bootstrap omits a fraction (1 − 1/N)N → e−1 ≈ 36.8% of points. Averaging predictions from trees that excluded each point gives an estimate of generalization error that is asymptotically equivalent to leave-one-out CV, at no extra training cost.
Feature importance. Impurity-based importance (feature_importances_) is biased toward high-cardinality features. Prefer permutation_importance on a held-out set for honest estimates.
Reach for it when
You want a robust baseline before investing in boosting
You need uncertainty estimates (via per-tree predictions)
OOB is desirable (small datasets, expensive CV)
The signal is heterogeneous — feature interactions vary across regimes
Skip it when
You need calibrated probabilities without post-hoc calibration
Memory is tight (forests are heavyweight at inference)
Smooth function approximation matters (forests are piecewise-constant)
The variable of interest is monotone in inputs and you need that respected
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
clf = RandomForestClassifier(
n_estimators=500,
max_features="sqrt",
min_samples_leaf=1,
oob_score=True,
bootstrap=True,
n_jobs=-1,
random_state=0,
).fit(X_train, y_train)
# Honest importance via permutation on held-out set
perm = permutation_importance(
clf, X_test, y_test, n_repeats=10, n_jobs=-1, random_state=0
)
ranked = sorted(zip(X_train.columns, perm.importances_mean), key=lambda x: -x[1])
print(f"OOB: {clf.oob_score_:.3f} Test: {clf.score(X_test, y_test):.3f}")
for name, imp in ranked[:10]:
print(f" {name:20s} {imp:+.4f}")