Experiment Tracking — Layerwise ML

Mode

Key idea

If a run isn't logged, it didn't happen. Treat every training run as an experiment with a unique ID, all its hyperparameters captured, all its metrics streamed, all its artefacts saved. You'll thank yourself in three weeks when you're trying to find "the one where I tried dropout 0.3 with cosine schedule."

Without tracking: spreadsheets, dated folders, and the eternal regret of "which checkpoint was the good one?". With tracking: a searchable dashboard of every run, plottable side-by-side, with every config and artefact one click away.

What to log. Hyperparameters, the git commit, the dataset version, training loss and val metrics (per step), system metrics (GPU util), and final artefacts (checkpoints, predictions, plots). For deep learning, also gradients norms and learning rate.

Tools

W&B (wandb): hosted, free for individuals, nicest UI. Default in many shops.
MLflow: open-source, self-hosted. Solid for production-adjacent workflows.
Comet: similar to W&B, hosted
TensorBoard: built into PyTorch, fine for one-off projects
Neptune, ClearML, Aim: niche but loyal fanbases

What to log

All hyperparameters, automatically (Hydra integration is nice)
Train loss + every val metric, per step
The git commit hash and any uncommitted diffs
The dataset version / hash
Final checkpoints and prediction artefacts
Learning rate, gradient norms, system utilisation

import wandb
import torch

run = wandb.init(
    project="my-project",
    config={
        "lr":      1e-3,
        "batch":   64,
        "model":   "resnet18",
        "dataset": "cifar10",
    },
    tags=["baseline"],
)

for step in range(num_steps):
    loss = train_step()
    val_acc = validate() if step % 500 == 0 else None
    wandb.log({"train/loss": loss.item(),
               "val/acc":   val_acc,
               "lr":        scheduler.get_last_lr()[0]}, step=step)

# Save the final model
torch.save(model.state_dict(), "final.pt")
wandb.save("final.pt")
run.finish()

Want the schema discipline, comparison patterns, & integration with Hydra?

A run is a row, a metric is a column

$$ \text{Run}: (\text{config}, \text{git}, \text{data version}) \;\to\; (\text{metrics}_{t}, \text{artefacts}) $$

Every run uniquely identified
Configs flatten to columns; metrics are time series; artefacts are pointers
Compare across runs by querying the dashboard

$$ \text{Run} : (\text{config},\; \text{git sha},\; \text{data version}) \;\to\; (\text{metrics over time},\; \text{artefacts}) $$

In words. Think of every training run as a function from inputs to outputs. The inputs are everything that defines the run: the config (hyperparameters), the git commit (which code), and the dataset version. The outputs are the time-series of metrics (loss, accuracy, etc., indexed by step t) and the final artefacts (checkpoints, predictions, plots). The arrow → here means "produces". In tracking systems this gets flattened into a database row per run — inputs become searchable columns, metrics become charts you can overlay, artefacts become files keyed by run ID.

configall hyperparameters and runtime settings
git shacommit hash (ideally plus uncommitted diff) for the code
data versiondataset hash or pointer to a fixed dataset version
metrics over timescalar series indexed by training step
artefactscheckpoints, prediction dumps, evaluation reports

Logging discipline. Decide on a schema upfront and stick to it: train/loss, val/loss, val/acc, val/precision. Don't have loss in one run and training_loss in another — the dashboard can't compare them.

Logging frequency. Train loss per batch is too noisy and expensive; per N batches (50–500) is fine. Val metrics per epoch (or every K steps). Gradient norms periodically — they're often the first sign of trouble.

Tagging and grouping. Add tags ("baseline", "ablation-dropout", "phase-2") to make runs filterable. Group runs by sweep ID so you can compare "all hyperparameter search trials from yesterday".

Versioning artefacts. Models, prediction dumps, and evaluation reports. W&B's Artifacts and MLflow's Model Registry both handle this — promote to "staging" then "production" with explicit tracking of which run produced each.

The git + diff trick. Save the commit hash AND the uncommitted diff (git diff HEAD). Now you can reproduce any run exactly, even if it was launched from a dirty working tree.

Hydra + W&B. Hydra parses the YAML config; pass the resulting dict directly to wandb.init(config=...). Now every Hydra override automatically shows up as a column in the dashboard.

import hydra, wandb, subprocess
from omegaconf import DictConfig, OmegaConf

@hydra.main(config_path="configs", config_name="train")
def main(cfg: DictConfig):
    # Capture the git state for reproducibility
    git_sha  = subprocess.check_output(["git", "rev-parse", "HEAD"]).decode().strip()
    git_diff = subprocess.check_output(["git", "diff", "HEAD"]).decode()

    run = wandb.init(
        project=cfg.project,
        config=OmegaConf.to_container(cfg, resolve=True),
        tags=cfg.get("tags", []),
        notes=f"git: {git_sha[:8]}",
    )
    if git_diff:
        # Save the uncommitted diff as an artefact
        with open("uncommitted.diff", "w") as f: f.write(git_diff)
        wandb.save("uncommitted.diff")

    train(cfg)
    run.finish()

Want offline runs, distributed logging, and team workflows?

A complete run record

$$ \text{record} = (\text{code}, \text{config}, \text{data}, \text{env}, \text{seeds}, \text{metrics}, \text{artefacts}) $$

Code: git commit + uncommitted diff
Env: lockfile or container hash
Data: dataset hash / version pointer
Seeds: random, numpy, torch, cuda

$$ \text{run record} \;=\; (\text{code},\; \text{config},\; \text{data},\; \text{environment},\; \text{seeds},\; \text{metrics},\; \text{artefacts}) $$

In words. A complete record of a single training run is a tuple of seven things — drop any one and you lose reproducibility. Code is the exact source (git sha + uncommitted diff). Config is all hyperparameters. Data is which dataset version was used. Environment captures library versions (lockfile) or the container hash. Seeds are the random seeds for Python, NumPy, PyTorch, and CUDA — without these, "same code, same data" still gives different results. Metrics are the time series. Artefacts are the binary outputs (checkpoints, plots).

codegit commit + any uncommitted diff
configall hyperparameters and runtime settings
datadataset version hash or pointer
environmentlockfile (uv.lock, poetry.lock) or container digest
seedsrandom seeds for Python, NumPy, PyTorch, CUDA
metricsscalar time series logged during the run
artefactssaved checkpoints, prediction dumps, plots

Offline runs. When training on a cluster without internet, log to local files first (WANDB_MODE=offline or MLflow's local backend) and sync after. Same for W&B Service mode on isolated networks.

Distributed logging. Only rank 0 should log scalar metrics — otherwise every rank writes them and you get an N× inflated step counter. Rank 0 broadcasts the run ID; other ranks can save per-rank artefacts (gradient histograms) under separate keys.

Team workflows. Shared projects with role-based access (W&B Teams, MLflow's auth). Naming conventions (project_phase_owner). Tagging discipline (baseline, candidate, production). Reports/Notebooks for sharing findings.

Cost tracking. Modern dashboards can ingest GPU-hours and dollar cost per run. Useful for blameless retros and for spotting runs that are 80% of the budget for 5% of the gain.

Integration with experiment platforms. Optuna's trials can stream into W&B sweeps. Lightning's Trainer auto-logs to whichever tracker you set. AzureML, Vertex, SageMaker all bridge to common trackers.

The "experiment journal" pattern. Per project, maintain a short markdown "what I tried, what happened, why" file alongside the code. Tracking dashboards are noisy; a few hand-written sentences per branch is what you'll actually re-read.

import os, torch.distributed as dist
import wandb

def setup_logging(cfg):
    is_rank0 = dist.get_rank() == 0 if dist.is_initialized() else True
    if is_rank0:
        run = wandb.init(project=cfg.project, config=cfg)
        # Share the run ID with other ranks so they know who they are
        os.environ["WANDB_RUN_ID"] = run.id
    return is_rank0

def log_metrics(metrics, step, is_rank0):
    if not is_rank0: return
    wandb.log(metrics, step=step)

# Offline-then-sync workflow for clusters without internet
# Run: WANDB_MODE=offline python train.py
# Later: wandb sync wandb/offline-run-...

Too dense?