Docs Speed Guide

Speed Guide

Opt-in flags that reduce training time: AMP precision, torch.compile, DataLoader workers, memory pinning, prefetch, and async checkpointing.


Overview

KynML is conservative by default — fp32, no compilation, num_workers=0. Every flag in this guide is an explicit opt-in. Enable them incrementally; each adds measurable overhead if misused on the wrong hardware.


Mixed-precision training (precision)

What it does

precision = fp16 or precision = bf16 wraps the forward pass in torch.amp.autocast and uses GradScaler for gradient scaling. The backward pass stays in fp32 internally. Net effect: roughly 1.5–2x throughput on Ampere+ GPUs due to tensor core utilisation.

train:
    ...
    precision = fp16

or

train:
    ...
    precision = bf16

Generated code

from torch.cuda.amp import GradScaler, autocast

# in train_model():
_use_amp = torch.cuda.is_available()
_scaler = GradScaler() if _use_amp else None

# per batch:
_amp_ctx = torch.amp.autocast(
    device_type=device.type, dtype=torch.float16
) if _use_amp else contextlib.nullcontext()
with _amp_ctx:
    predictions = model(features)
    loss = criterion(predictions, target)
# scaler path:
if _scaler is not None:
    _scaler.scale(loss).backward()
    _scaler.step(optimizer)
    _scaler.update()
else:
    loss.backward()
    optimizer.step()

When to use

Scenario Recommendation
NVIDIA GPU (Ampere / Hopper) fp16 or bf16 — expect 1.5–2x speedup
NVIDIA Volta (V100) fp16 — tensor cores present but bf16 not supported
AMD GPU (ROCm) fp16 — check ROCm version; bf16 support varies
Apple Silicon (MPS) Avoid — AMP is gated on cuda.is_available(), so it silently falls back to fp32. No error, no speedup.
CPU-only Avoid — _use_amp = False, code runs fp32. Overhead: near zero, but no benefit.

bf16 is numerically stabler than fp16 (no GradScaler needed in principle, though KynML still uses one for safety). Prefer bf16 on Ampere+. Use fp16 if your GPU predates Ampere or if you have existing fp16 infrastructure.

Default

precision = fp32 — no AMP, no imports, standard loss.backward() / optimizer.step().


torch.compile (compile = true)

What it does

Calls torch.compile(model) after instantiation. PyTorch 2.0+ traces and compiles the model graph, typically via TorchInductor. First epoch is slow (compilation); subsequent epochs are faster — expect 10–30 % speedup on GPU, variable on CPU.

train:
    ...
    compile = true

Generated code

# in main():
model = HousePriceModel().to(device)
model = torch.compile(model)   # compile = true
# vs.
# compile flag is False; skipping

When to use

Scenario Recommendation
GPU training, many epochs Yes — amortises compilation cost after epoch 1
GPU training, few epochs (< 5) Likely not worth it — compilation overhead dominates
CPU training Possible but modest gains; adds ~30–60s startup time
Apple MPS Not supported in most PyTorch builds
Debugging Disable — compiled graphs suppress readable tracebacks

torch.compile and precision = fp16/bf16 stack cleanly — use both for maximum GPU throughput.


DataLoader workers (num_workers, pin_memory, prefetch)

These are dataset-block fields that control I/O parallelism.

num_workers

dataset HouseData:
    source = csv("data/housing.csv")
    target = "price"
    num_workers = 4

Sets DataLoader(num_workers=4). Worker processes prefill batches while the GPU trains. Default is 0 (main process only).

Generated code:

dataloader_extra = "num_workers=4, pin_memory=False"
DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=4, pin_memory=False)

Guidelines:

  • Start with num_workers = 4 on a machine with 8+ CPU cores and a GPU.
  • On CPU-only machines, num_workers > 0 adds IPC overhead with no GPU to overlap — keep at 0.
  • On macOS, DataLoader workers use fork by default which can conflict with some libraries. Set PYTHONWARNINGS=ignore or explicitly use multiprocessing_context="spawn" in the generated script if you see crashes.
  • Rule of thumb: num_workers = num_CPU_cores / 2, capped at 8.

pin_memory

dataset HouseData:
    source = csv("data/housing.csv")
    target = "price"
    num_workers = 4
    pin_memory = true

Sets DataLoader(pin_memory=True). Allocates host memory in CUDA pinned (non-pageable) memory, enabling faster async host-to-device transfers via DMA.

  • Only effective with a CUDA GPU. Has no effect and may add overhead on CPU-only or MPS.
  • Requires num_workers >= 1 to see benefit — pinned memory is filled in the worker processes.

prefetch

dataset HouseData:
    source = csv("data/housing.csv")
    target = "price"
    num_workers = 4
    pin_memory = true
    prefetch = 2

Sets DataLoader(prefetch_factor=2). Each worker pre-loads 2 batches ahead. Reduces GPU idle time between batches.

Generated code:

DataLoader(train_dataset, batch_size=32, shuffle=True,
           num_workers=4, pin_memory=True, prefetch_factor=2)

Requires num_workers >= 1. Values of 2–4 are typical; larger values increase memory usage.

dataset TrainData:
    source = csv("data/train.csv")
    target = "label"
    num_workers = 4
    pin_memory = true
    prefetch = 2
dataset TrainData:
    source = csv("data/train.csv")
    target = "label"
    num_workers = 0   # or omit — 0 is default
    pin_memory = false

Async checkpoint saving (async_save = true)

What it does

By default, checkpoint saves synchronously: torch.save(state, path) blocks the training loop. With async_save = true, the save is offloaded to a daemon thread so training continues immediately.

train:
    ...
    checkpoint = checkpoint(every_n=5, path="checkpoints/ckpt.pt", async_save=true)

Generated code

# async_save = false (default):
torch.save(_state, _ckpt_path)

# async_save = true:
_t = threading.Thread(target=torch.save, args=(_state, _ckpt_path), daemon=True)
_t.start()

The threading import is always present in the generated script header.

When to use

  • On large models (100M+ parameters) where torch.save takes > 1s per checkpoint.
  • When checkpointing every epoch and training on fast GPUs where the save becomes a measurable stall.
  • Safe for regular checkpoint files. Daemon threads are joined when the process exits.

Caveats

  • If training crashes immediately after _t.start(), the checkpoint file may be partially written. The resume path (_resume_path.exists()) will still attempt to load it. Add error handling in the generated script if this is a concern.
  • Not a substitute for persistent storage on distributed jobs — use Hugging Face Accelerate or PyTorch DCP for that.

Putting it all together

A high-throughput GPU training spec:

dataset LargeData:
    source = csv("data/large_train.csv")
    target = "outcome"
    split = 0.8
    normalize = true
    num_workers = 4
    pin_memory = true
    prefetch = 2

model DeepNet:
    input 128
    dense 512 gelu
    batchnorm
    dropout 0.3
    dense 256 gelu
    batchnorm
    dropout 0.2
    dense 64 relu
    dense 1 linear

train:
    model = DeepNet
    data = LargeData
    loss = mse
    optimizer = adamw(lr=0.001, weight_decay=0.01)
    epochs = 100
    batch = 256
    device = auto
    scheduler = cosine(t_max=100)
    early_stop = early_stop(patience=10)
    checkpoint = checkpoint(every_n=10, path="checkpoints/deepnet.pt", async_save=true)
    precision = bf16
    compile = true

evaluate:
    metrics = [mae, rmse]

export:
    format = torchscript
    path = "models/deepnet.pt"

Expected speedup stack on an A100 vs baseline fp32 / compile=false / num_workers=0:

Flag Typical gain
precision = bf16 1.5–2x
compile = true 1.1–1.3x
num_workers = 4 1.1–1.4x (I/O bound workloads)
pin_memory + prefetch 1.05–1.15x

Gains compound but with diminishing returns once training is fully compute-bound.


Reference: all speed flags

Flag Block Default Effect
precision = fp16 train fp32 AMP with float16 and GradScaler
precision = bf16 train fp32 AMP with bfloat16 and GradScaler
compile = true train false torch.compile(model)
num_workers = N dataset 0 DataLoader worker processes
pin_memory = true dataset false Pinned host memory for GPU transfer
prefetch = N dataset none DataLoader prefetch_factor
async_save = true checkpoint(...) false Threaded torch.save