Comparing models in production — sample size, statistical significance, bandits, and the human factors.
Mode
Key idea
You can't tell from offline metrics alone whether a model is "better". Offline AUC up doesn't always mean online conversion up. A/B test in production with a clearly-defined metric, enough sample size for the effect you care about, and pre-registered analysis. Then decide.
The basic A/B. Random 50/50 split of users (or sessions). Group A (control) gets the current model; Group B gets the candidate. Compare a primary metric over a fixed evaluation window. Hypothesis test: is the difference larger than chance?
Sample size. Smaller effect → larger sample needed. The standard formula: n ≈ 16 / d² where d is the effect size in standard-deviation units. A 1% relative improvement on a 10%-baseline metric needs ~20 000 users per arm.
Common gotchas. Peeking at results and stopping early (inflates false-positive rate). Network effects (the treatment of one user affects another). Novelty effects (users react to anything new, then revert). Seasonality (test on a representative time window).
In words. This is the standard sample-size formula: how many users you need in each arm to reliably detect an effect. The numerator grows with the variance of your metric (σ²) — noisier metrics need more data. The z-values come from your tolerated false-positive rate α (typically 5%) and false-negative rate β (typically 20%); together they encode "how confident do I want to be?". The denominator is the squared minimum detectable effectδ — the smaller the lift you want to catch, the more users you need. Halving the effect quadruples the sample.
users per armhow many users you need in each of A and B
metric spreadstandard deviation of the outcome metric (σ)
z-valuesquantiles of the normal distribution at confidence level α and power 1−β
minimum effectsmallest lift δ you care about detecting
Sequential testing. Fixed-horizon A/B requires committing to a sample size up front. Sequential / always-valid p-values let you peek without inflating false positives. Methods: SPRT, mSPRT, group-sequential designs, e-values. Tools: Optimizely, Statsig, Eppo all implement some variant.
CUPED (variance reduction). Microsoft's technique: regress the metric on a pre-experiment covariate (the same user's metric before the test). The residuals have much lower variance, so you need fewer users for the same power. 20–50% sample size reduction is typical.
Multi-armed bandits. Instead of fixed splits, allocate more traffic to better-performing arms over time. Thompson sampling is the popular choice. Trade-off: faster learning vs cleaner causal inference. Use bandits for exploitation; A/B for explanation.
Bayesian A/B. Compute the posterior probability that B beats A by at least δ. More intuitive for stakeholders ("80% chance B is better"). Same data; different interpretation. Doesn't fix multiple-comparisons or peeking by itself.
Multiple comparisons. Testing many metrics increases the chance of a false positive somewhere. Pre-register a primary metric; report others as "exploratory". Bonferroni or Benjamini-Hochberg for principled correction.
Subgroup analysis. The treatment can help one segment and hurt another. Slice by demographics, geography, device. Watch for Simpson's paradox: aggregate looks fine, every subgroup is worse.
Estimate the value of a new policy π from logs of an old behaviour policy
Doesn't require deploying π to evaluate it
Variance grows with how different π is from behaviour
$$ \text{estimated value} \;=\; \text{average over logs of}\; \frac{\text{1 if new policy agrees with log}}{\text{prob old policy picked that action}} \times \text{reward} $$
In words. You want to estimate how a new policy π (a way of choosing actions) would perform, using only logged data from the old "behaviour" policy. For each logged event, check whether the new policy would have chosen the same action; if yes, keep the reward but reweight it by dividing by the probability the old policy assigned to that action. The 𝟙{·} (indicator) is 1 when the condition holds, 0 otherwise. Dividing by the logging probability corrects for the bias from the old policy picking some actions more than others — actions that were rare under the old policy get up-weighted because we have less data about them. Average across all n logged events to get the estimate.
estimated valueexpected reward of the new policy, estimated from old logs
new policythe policy π you want to evaluate (without deploying it)
prob old policy picked that actionthe propensity score from the logging policy
rewardobserved outcome (click, conversion, revenue) for that logged event
Interleaving. Show both models' results to the same user (e.g., ranked-list problems — interleave A's and B's recommendations). Much higher statistical power per user; works because each user is their own control. Used heavily at Microsoft, Netflix, search engines.
Off-policy evaluation (OPE). Estimate how a new policy would perform from logs of a previous one. IPS (inverse propensity scoring), doubly robust estimators, model-based OPE. Standard in recommendation and ad-ranking. Requires the logging policy to have explored — if it always picked the same thing, you can't evaluate alternatives.
Long-term effects. Some changes hurt short-term metrics but help long-term (paywalls, ads, content moderation). Different evaluation: long-term holdouts, instrumental variables, or careful causal modelling. Hard; rarely done well.
Heterogeneous treatment effects (HTE). The treatment helps some users and hurts others. Estimating τ(x) = E[Y(1) − Y(0) | X = x] with causal forests, double ML, or T-/X-learners. Useful for targeted deployment.
Switchback experiments. When users can't be split (e.g., ride-share dispatching), switch the treatment on and off over time within the same population. Mitigates network effects.
Pre-registration & audit. Write the analysis plan before looking at results. Commit to one primary metric, one stopping rule, one analysis method. Reduces hindsight-driven p-hacking — and makes the test reusable in retros.
Cost-aware testing. Each "challenger" model deployment incurs implementation, ramp, and rollback costs. Decision-theoretic framing: expected value of running the test > cost of running it? Often the test isn't worth running because the expected effect is too small.
import numpy as np
# Inverse Propensity Scoring — estimate policy value from logged data
def ips_estimate(actions, rewards, propensities, new_policy):
"""
actions: array of taken actions
rewards: observed reward for each
propensities: P(action | context) under the LOGGING policy
new_policy: function returning P(action | context) under the NEW policy
"""
weights = np.array([
new_policy(a, ctx) / propensities[i]
for i, (a, ctx) in enumerate(zip(actions, contexts))
])
return (weights * rewards).mean()
# Switchback for network-effect mitigation
def switchback_test(treatment_schedule, observations):
"""Treatment switches on/off in blocks; compare blocks of A vs B."""
df = pd.DataFrame(observations)
df["assignment"] = treatment_schedule
return df.groupby("assignment")["metric"].mean()