Random Forests — Layerwise ML

Mode

Key idea

Ask a crowd of decision trees, then take the majority vote. Each tree alone is shaky and makes its own mistakes, but the trees disagree in different places — so when you average their answers, the errors cancel out and the agreement points to the truth.

Slide N up — watch the ensemble boundary smooth out compared to the brittle individual trees below

Data Trees N = 20

Each tree below was trained on a different bootstrap sample of the data with a random feature subset at every split — so they're all biased, all brittle, all wrong in different ways. But the big panel up top is the average of their votes. The wrongness cancels; the agreement reinforces. Slide N from 1 to 80 and watch the boundary go from jagged-and-overconfident to smooth-and-stable.

A decision tree is a flowchart of yes/no questions ("Is age > 30?" → "Is income > 50k?" → …) that ends in a prediction. It's intuitive, but it's also brittle: small changes in your training data can produce a wildly different tree.

A random forest builds many such trees, each on a slightly different sample of your data, and lets them vote. It's a bit like asking 100 doctors for a second opinion instead of trusting one — the consensus is more reliable than any individual.

Reach for it when

Your data is in rows and columns (a spreadsheet)
You want something that "just works" with minimal setup
You're not sure where to start
You want a reasonable accuracy estimate without extra effort

Skip it when

You need to explain why a specific prediction came out a certain way
Your data is images, text, or sequences (use neural networks)
You need predictions outside the range you trained on

from sklearn.ensemble import RandomForestClassifier

# Train: just give it labelled data
model = RandomForestClassifier(n_estimators=100, random_state=0)
model.fit(X_train, y_train)

# Predict
predictions = model.predict(X_test)
accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2%}")

Want to know how it actually works?

Key idea

$$ \hat{y}(x) \;=\; \frac{1}{B} \sum_{b=1}^{B} T_b(x) $$

ŷ(x)the forest's prediction for input x
Bthe number of trees (typically 100–500)
Tb(x)the prediction from the b-th tree

$$ \text{forest prediction} \;=\; \text{average of all tree predictions} $$

In words. Ask every tree what it thinks about input x, then take the average of their answers (for classification, that's a majority vote; for regression, a numeric mean). B is how many trees you have — usually a few hundred. Each individual tree Tb was trained on a bootstrap re-sample of the data while only considering a random subset of features at every split, so the trees are deliberately decorrelated. Averaging cancels out their independent errors while preserving the signal they all agree on.

forest predictionthe ensemble's final answer at input x
tree predictionseach individual tree's vote or value for x
Bnumber of trees in the forest (typically 100–500)

Each tree is trained on a bootstrap sample with a random feature subset at each split. Both sources of randomness decorrelate the trees so the average cuts variance.

A single decision tree is high-variance — small changes to the training data produce very different trees. Random forests fight this by training many trees on slightly different views of the data, then averaging their predictions.

Two knobs do the work. Bootstrap sampling means each tree sees a different random sample (with replacement) of the training set. Feature subsetting means each split only considers a random subset of features (typically √p for classification, p/3 for regression). Together they ensure the trees disagree in independent ways.

Bonus: each bootstrap leaves out about ⅓ of the data per tree. Aggregating the trees that excluded each point gives an out-of-bag (OOB) estimate of generalization error — no cross-validation needed.

Reach for it when

Tabular data with mixed feature types
You need a strong baseline with almost no tuning
Robustness matters more than the last 2% of accuracy
You want a free OOB error estimate

Skip it when

Extrapolation outside the training range is required
Monotonicity constraints are required
You need per-prediction interpretability
High-signal tabular — gradient boosting usually wins

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(
    n_estimators=200,
    max_features="sqrt",   # √p features considered per split
    oob_score=True,        # free generalization estimate
    n_jobs=-1,
    random_state=0,
)
clf.fit(X_train, y_train)

print(f"OOB accuracy: {clf.oob_score_:.3f}")
for name, imp in sorted(
    zip(X_train.columns, clf.feature_importances_),
    key=lambda x: -x[1],
)[:5]:
    print(f"  {name:20s} {imp:.3f}")

Want the bias-variance derivation and honest diagnostics?

Variance reduction

$$ \mathrm{Var}(\hat{y}) \;=\; \rho\,\sigma^2 \;+\; \frac{1-\rho}{B}\,\sigma^2 $$

σ²variance of an individual tree's prediction
ρpairwise correlation between trees on the same input
Bnumber of trees

$$ \text{forest variance} \;=\; \text{correlation} \times \text{tree variance} \;+\; \frac{1 - \text{correlation}}{\text{number of trees}} \times \text{tree variance} $$

In words. The variance of the averaged forecast splits into two pieces. The first is correlation (called ρ, rho — a number between 0 and 1 measuring how similarly trees vote on the same input) times the variance of a single tree. The second piece shrinks with the number of trees in the denominator — so adding more trees only reduces the second piece, while the first piece is a hard floor you can't escape no matter how many trees you grow. The whole point of feature subsetting is to push correlation ρ closer to zero so that floor is lower.

forest variancespread of the ensemble's prediction at a fixed input across re-fits
tree variancespread of a single tree's prediction at the same input
correlationhow similarly any two trees vote on the same input — lower is better
number of treesmore trees only shrink the second term; the first sets the floor

As B → ∞ the second term vanishes — variance is floored at ρσ². The whole game is to drive ρ down via per-split feature subsets without making each tree too weak (which inflates σ²).

Random forests are bagging applied to trees, with one twist: at each split, only a random subset of m features (out of p) is considered. This injects an extra source of decorrelation beyond bootstrap sampling alone, addressing the fact that standard bagging produces highly correlated trees whenever a few features dominate the splits.

The max_features hyperparameter trades off correlation (low m → low ρ) against individual tree strength (low m → higher σ²). Empirical defaults — √p for classification, p/3 for regression — work remarkably well across domains.

Out-of-bag error. Each bootstrap omits a fraction (1 − 1/N)^N → e⁻¹ ≈ 36.8% of points. Averaging predictions from trees that excluded each point gives an estimate of generalization error that is asymptotically equivalent to leave-one-out CV, at no extra training cost.

Feature importance. Impurity-based importance (feature_importances_) is biased toward high-cardinality features. Prefer permutation_importance on a held-out set for honest estimates.

Reach for it when

You want a robust baseline before investing in boosting
You need uncertainty estimates (via per-tree predictions)
OOB is desirable (small datasets, expensive CV)
The signal is heterogeneous — feature interactions vary across regimes

Skip it when

You need calibrated probabilities without post-hoc calibration
Memory is tight (forests are heavyweight at inference)
Smooth function approximation matters (forests are piecewise-constant)
The variable of interest is monotone in inputs and you need that respected

from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance

clf = RandomForestClassifier(
    n_estimators=500,
    max_features="sqrt",
    min_samples_leaf=1,
    oob_score=True,
    bootstrap=True,
    n_jobs=-1,
    random_state=0,
).fit(X_train, y_train)

# Honest importance via permutation on held-out set
perm = permutation_importance(
    clf, X_test, y_test, n_repeats=10, n_jobs=-1, random_state=0
)
ranked = sorted(zip(X_train.columns, perm.importances_mean), key=lambda x: -x[1])

print(f"OOB: {clf.oob_score_:.3f}   Test: {clf.score(X_test, y_test):.3f}")
for name, imp in ranked[:10]:
    print(f"  {name:20s} {imp:+.4f}")

Too dense?