Image by Author
If you train models beyond a single notebook, you’ve probably hit the same headaches: you tweak five knobs, rerun training, and by Friday you can’t remember which run produced the “good” ROC curve or which data slice you used. Weights & Biases (W&B) gives you a paper trail — metrics, configs, plots, datasets, and models — so you can answer what changed with evidence, not guesswork.
Below is a practical tour. It’s opinionated, light on ceremony, and geared for teams who want a clean experiment history without building their own platform. Let’s call it a no-fluff walkthrough.
# Why W&B at All?
Notebooks grow into experiments. Experiments multiply. Soon you’re asking: Which run used that data slice? Why is today’s ROC curve higher? Can I reproduce last week’s baseline?
W&B gives you a place to:
- Log metrics, configs, plots, and system stats
- Version datasets and models with artifacts
- Run hyperparameter sweeps
- Share dashboards without screenshots
You can start tiny and layer features when needed.
# Setup in 60 Seconds
Start by installing the library and logging in with your API key. If you don’t have one yet, you can find it here.
pip install wandb
wandb login # paste your API key once

Image by Author
// Minimal Sanity Check
import wandb, random, time
wandb.init(project="kdn-crashcourse", name="hello-run", config={"lr": 0.001, "epochs": 5})
for epoch in range(wandb.config.epochs):
loss = 1.0 / (epoch + 1) + random.random() * 0.05
wandb.log({"epoch": epoch, "loss": loss})
time.sleep(0.1)
wandb.finish()
Now you should see something like this:

Image by Author
Now let’s go for the useful bits.
# Tracking Experiments Properly
// Log Hyperparameters and Metrics
Treat wandb.config
as the single source of truth for your experiment’s knobs. Give metrics clear names so charts auto-group.
cfg = dict(arch="resnet18", lr=3e-4, batch=64, seed=42)
run = wandb.init(project="kdn-mlops", config=cfg, tags=["baseline"])
# training loop ...
for step, (x, y) in enumerate(loader):
# ... compute loss, acc
wandb.log({"train/loss": loss.item(), "train/acc": acc, "step": step})
# log a final summary
run.summary["best_val_auc"] = best_auc
A few tips:
- Use namespaces like
train/loss
orval/auc
to group charts automatically - Add tags like
"lr-finder"
or"fp16"
so you can filter runs later - Use
run.summary[...]
for one-off results you want to see on the run card
// Log Images, Confusion Matrices, and Custom Plots
wandb.log({
"val/confusion": wandb.plot.confusion_matrix(
preds=preds, y_true=y_true, class_names=classes)
})
You can also save any Matplotlib plot:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(history)
wandb.log({"training/curve": fig})
// Version Datasets and Models With Artifacts
Artifacts answer questions like, “Which exact files did this run use?” and “What did we train?” No more final_final_v3.parquet
mysteries.
import wandb
run = wandb.init(project="kdn-mlops")
# Create a dataset artifact (run once per version)
raw = wandb.Artifact("imdb_reviews", type="dataset", description="raw dump v1")
raw.add_dir("data/raw") # or add_file("path")
run.log_artifact(raw)
# Later, consume the latest version
artifact = run.use_artifact("imdb_reviews:latest")
data_dir = artifact.download() # folder path pinned to a hash
Log your model the same way:
import torch
import wandb
run = wandb.init(project="kdn-mlops")
model_path = "models/resnet18.pt"
torch.save(model.state_dict(), model_path)
model_art = wandb.Artifact("sentiment-resnet18", type="model")
model_art.add_file(model_path)
run.log_artifact(model_art)
Now, the lineage is obvious: this model came from that data, under this code commit.
// Tables for Evaluations and Error Analysis
wandb.Table
is a light dataframe for results, predictions, and slices.
table = wandb.Table(columns=["id", "text", "pred", "true", "prob"])
for r in batch_results:
table.add_data(r.id, r.text, r.pred, r.true, r.prob)
wandb.log({"eval/preds": table})
Filter the table in the UI to find failure patterns (e.g., short reviews, rare classes, etc.).
// Hyperparameter Sweeps
Define a search space in YAML, launch agents, and let W&B coordinate.
# sweep.yaml
method: bayes
metric: {name: val/auc, goal: maximize}
parameters:
lr: {min: 1e-5, max: 1e-2}
batch: {values: [32, 64, 128]}
dropout: {min: 0.0, max: 0.5}
Start the sweep:
wandb sweep sweep.yaml # returns a SWEEP_ID
wandb agent // # run 1+ agents
Your training script should read wandb.config
for lr
, batch
, etc. The dashboard shows top trials, parallel coordinates, and the best config.
# Drop-In Integrations
Pick the one you use and keep moving.
// PyTorch Lightning
from pytorch_lightning.loggers import WandbLogger
logger = WandbLogger(project="kdn-mlops")
trainer = pl.Trainer(logger=logger, max_epochs=10)
// Keras
import wandb
from wandb.keras import WandbCallback
wandb.init(project="kdn-mlops", config={"epochs": 10})
model.fit(X, y, epochs=wandb.config.epochs, callbacks=[WandbCallback()])
// Scikit-learn
from sklearn.metrics import roc_auc_score
wandb.init(project="kdn-mlops", config={"C": 1.0})
# ... fit model
wandb.log({"val/auc": roc_auc_score(y_true, y_prob)})
# Model Registry and Staging
Think of the registry as a named shelf for your best models. You push an artifact once, then manage aliases like staging
or production
so downstream code can pull the right one without guessing file paths.
run = wandb.init(project="kdn-mlops")
art = run.use_artifact("sentiment-resnet18:latest")
registry = wandb.sdk.artifacts.model_registry.ModelRegistry()
entry = registry.push(art, name="sentiment-classifier")
entry.aliases.add("staging")
Flip the alias when you promote a new build. Consumers always read sentiment-classifier:production
.
# Reproducibility Checklist
- Configs: Store every hyperparameter in
wandb.config
- Code and commit: Use
wandb.init(settings=wandb.Settings(code_dir="."))
to snapshot code or rely on CI to attach the git SHA - Environment: Log
requirements.txt
or the Docker tag and include it in an artifact - Seeds: Log them and set them
Minimal seed helper:
def set_seeds(s=42):
import random, numpy as np, torch
random.seed(s)
np.random.seed(s)
torch.manual_seed(s)
torch.cuda.manual_seed_all(s)
# Collaboration and Sharing Without Screenshots
Add notes and tags so teammates can search. Use Reports to stitch charts, tables, and commentary into a link you can drop in Slack or a PR. Stakeholders can follow along without opening a notebook.
# CI and Automation Tips
- Run
wandb agent
on training nodes to execute sweeps from CI - Log a dataset artifact after your ETL job; train jobs can depend on that version explicitly
- After evaluation, promote model aliases (
staging
→production
) in a small post-step - Pass
WANDB_API_KEY
as a secret and group related runs withWANDB_RUN_GROUP
# Privacy and Reliability Tips
- Use private projects by default for teams
- Use offline mode for air-gapped runs. Train normally, then
wandb sync
later
export WANDB_MODE=offline
- Don’t log raw PII. If needed, hash IDs before logging.
- For large files, store them as artifacts instead of attaching them to
wandb.log
.
# Common Snags (and Quick Fixes)
- “My run didn’t log anything.” The script may have crashed before
wandb.finish()
was called. Also, check that you haven’t setWANDB_DISABLED=true
in your environment. - Logging feels slow. Log scalars at each step, but save heavy assets like images or tables for the end of an epoch. You can also pass
commit=False
towandb.log()
and batch multiple logs together. - Seeing duplicate runs in the UI? If you are restarting from a checkpoint, set
id
andresume="allow"
inwandb.init()
to continue the same run. - Experiencing mystery data drift? Put every dataset snapshot into an Artifact and pin your runs to explicit versions.
# Pocket Cheatsheet
// 1. Start a Run
wandb.init(project="proj", config=cfg, tags=["baseline"])
// 2. Log Metrics, Images, or Tables
wandb.log({"train/loss": loss, "img": [wandb.Image(img)]})
// 3. Version a Dataset or Model
art = wandb.Artifact("name", type="dataset")
art.add_dir("path")
run.log_artifact(art)
// 4. Consume an Artifact
path = run.use_artifact("name:latest").download()
// 5. Run a Sweep
wandb sweep sweep.yaml && wandb agent //
# Wrapping Up
Start small: initialize a run, log a few metrics, and push your model file as an artifact. When that feels natural, add a sweep and a short report. You’ll end up with reproducible experiments, traceable data and models, and a dashboard that explains your work without a slideshow.
Josep Ferrer is an analytics engineer from Barcelona. He graduated in physics engineering and is currently working in the data science field applied to human mobility. He is a part-time content creator focused on data science and technology. Josep writes on all things AI, covering the application of the ongoing explosion in the field.