Mode

Key idea

Every script you'll run more than twice deserves a CLI. A clean command line + a config file gives you reproducibility, schedulability, and shareability for free. Modern Python has good tools (Typer, Hydra) that make this nearly trivial.

Three reasonable choices. typer: type-hint-driven, modern, great defaults. click: classic, mature, used everywhere. argparse: stdlib; fine for tiny scripts. fire: zero-config; useful for prototypes. Skip sys.argv parsing; you'll regret it.

The pattern. One CLI per script. Each command takes a config (YAML / Hydra). Hyperparameters in the config; flags for things that vary per-run (output dir, debug mode). The CLI just dispatches; the work is in modules.

What goes where

  • CLI flags: run-specific (out dir, debug, dry-run)
  • Config file: hyperparameters, paths, model architecture
  • Env vars: secrets, API keys, runtime config
  • Code: never put paths or hyperparameters here

Common mistakes

  • 15 positional arguments — make them named
  • Hard-coded paths in the CLI defaults
  • One giant script with subcommands for unrelated things
  • No --help text — users (you, in 3 weeks) will hate you

import typer
from pathlib import Path
import yaml

app = typer.Typer(no_args_is_help=True)

@app.command()
def train(
    config: Path = typer.Argument(..., help="Path to YAML config"),
    out:    Path = typer.Option("runs/", help="Output directory"),
    debug:  bool = typer.Option(False, help="Quick smoke run"),
    seed:   int  = typer.Option(0,    help="Random seed"),
):
    """Train a model from a YAML config."""
    cfg = yaml.safe_load(config.read_text())
    if debug:
        cfg["max_steps"] = 10
    run_training(cfg, out=out, seed=seed)

@app.command()
def evaluate(
    checkpoint: Path,
    test_data:  Path,
    threshold:  float = 0.5,
):
    """Evaluate a checkpoint on a held-out test set."""
    ...

if __name__ == "__main__":
    app()
Want Hydra, multi-command apps, & clean help-text patterns?

Config + CLI interaction

$$ \text{values} \;=\; \text{defaults} \;\triangleleft\; \text{config file} \;\triangleleft\; \text{env vars} \;\triangleleft\; \text{CLI flags} $$

  • = "overridden by"
  • Each layer wins over the previous
  • Standard precedence; matches what users expect

$$ \text{final values} \;=\; \text{defaults} \;\;\text{overridden by}\;\; \text{config} \;\;\text{overridden by}\;\; \text{env vars} \;\;\text{overridden by}\;\; \text{CLI flags} $$

In words. Configuration values come from multiple places, and you need a clear rule for who wins. The standard chain — read left to right, with each later source overriding any earlier one. So baked-in defaults are the weakest; the config file overrides those; environment variables override the config; and CLI flags override everything. This matches the principle "the more local / explicit the source, the higher its priority". The symbol in the math version is just shorthand for "overridden by".

  • defaultsfallback values written into the code
  • config fileYAML / TOML loaded at startup
  • env varsshell environment variables (good for secrets and CI)
  • CLI flagscommand-line arguments — highest priority

Hydra. Facebook's config framework. YAML configs, override from CLI (python train.py model.lr=1e-3), composable configs (defaults: [model: resnet, data: cifar10]), multi-runs / sweeps. Most ML production projects converge on Hydra.

Pydantic + CLI. Define configs as Pydantic models — get validation, type coercion, defaults. typer integrates well. Useful for strict schemas; pairs nicely with Pydantic-everywhere codebases.

Subcommands. my-tool train ..., my-tool evaluate ..., my-tool deploy .... Typer's decorator pattern. Better than one huge script with a --mode flag.

Help is documentation. Every flag gets a one-line description. --help output is what you'll read in 3 weeks; make it good. Examples in the docstring are nicer than the user manual.

Dry-run flag. --dry-run prints what would happen without doing it. Useful for destructive operations (training that overwrites, deployments, data writes).

Config logging. The script writes the fully-resolved config to the run's output directory. Every flag override, every default, every env var — recorded. Reproducibility starts here.

import hydra
from omegaconf import DictConfig, OmegaConf

@hydra.main(version_base=None, config_path="configs", config_name="train")
def main(cfg: DictConfig):
    # Hydra automatically creates an output dir and writes cfg.yaml there
    print(OmegaConf.to_yaml(cfg))                 # print the resolved config
    run_training(cfg)

if __name__ == "__main__":
    main()

# Usage:
# python train.py                                 # defaults
# python train.py model.lr=1e-2 data=cifar100      # override individual values
# python train.py --multirun model.lr=1e-2,1e-3,1e-4  # sweep
Want completion, plugins, & long-running daemon-style CLIs?

The CLI contract

$$ \text{stdin} \to \text{exit code, stdout, stderr} $$

  • Exit code 0 on success, non-zero on failure (matters for CI / shells)
  • stdout for data / results; stderr for logs / progress
  • Pipeable, scriptable, testable

$$ \text{input stream} \;\to\; (\text{exit code},\; \text{output stream},\; \text{error stream}) $$

In words. The Unix contract every well-behaved CLI honours. You read from stdin (standard input) and produce three outputs: a numeric exit code (0 means success, anything else means failure — shells and CI key off this), stdout (standard output — where actual results go), and stderr (standard error — where logs, warnings, and progress bars go). Keeping logs out of stdout means your output can be piped into the next command without contamination. This separation is what makes CLI tools composable.

  • exit codeinteger status; 0 = success, non-zero = failure
  • stdoutthe program's "real" output — pipe target
  • stderrdiagnostic messages, separate from stdout
  • stdininput stream (file, pipe, keyboard)

Shell completion. Typer and Click both auto-generate completion scripts for bash / zsh / fish. my-tool --install-completion. Cheap UX win; pays dividends every day.

Plugins. Click and Typer both support plugin loading from entry-points. Useful for very large CLIs (a "platform" CLI with sub-tools). Most ML projects don't need this — but it's there when you do.

Long-running CLIs. A train command might run for days. Print structured logs to stderr, write metrics to a logger / file, write checkpoints. Support graceful shutdown on Ctrl+C — save state and exit cleanly.

Daemon mode. Some CLIs spawn long-lived services (serving, monitoring). systemd unit files, Docker containers, or supervisord. The CLI itself should fork-and-detach cleanly or run in the foreground for the supervisor.

Testable CLIs. Click and Typer both ship CliRunner for invoking commands programmatically. Asserts exit code + stdout. Same as testing any function.

Environment variable conventions. Pydantic Settings + dotenv for secrets. Prefix env vars (MYAPP_LOG_LEVEL) to avoid collision. Document them.

Versioning. Every CLI has --version. Helps debugging "which version is the CI runner using" mysteries.

import typer
from typer.testing import CliRunner

app = typer.Typer()

@app.command()
def add(a: int, b: int):
    """Add two numbers."""
    typer.echo(a + b)

# Test the CLI as a unit
def test_add():
    runner = CliRunner()
    result = runner.invoke(app, ["3", "4"])
    assert result.exit_code == 0
    assert result.stdout.strip() == "7"
Too dense?