Speed Guide
Opt-in flags that reduce training time: AMP precision, torch.compile, DataLoader workers, memory pinning, prefetch, and async checkpointing.
Overview
KynML is conservative by default — fp32, no compilation, num_workers=0. Every flag in this guide is an explicit opt-in. Enable them incrementally; each adds measurable overhead if misused on the wrong hardware.
Mixed-precision training (precision)
What it does
precision = fp16 or precision = bf16 wraps the forward pass in torch.amp.autocast and uses GradScaler for gradient scaling. The backward pass stays in fp32 internally. Net effect: roughly 1.5–2x throughput on Ampere+ GPUs due to tensor core utilisation.
train:
...
precision = fp16
or
train:
...
precision = bf16
Generated code
from torch.cuda.amp import GradScaler, autocast
# in train_model():
_use_amp = torch.cuda.is_available()
_scaler = GradScaler() if _use_amp else None
# per batch:
_amp_ctx = torch.amp.autocast(
device_type=device.type, dtype=torch.float16
) if _use_amp else contextlib.nullcontext()
with _amp_ctx:
predictions = model(features)
loss = criterion(predictions, target)
# scaler path:
if _scaler is not None:
_scaler.scale(loss).backward()
_scaler.step(optimizer)
_scaler.update()
else:
loss.backward()
optimizer.step()
When to use
| Scenario | Recommendation |
|---|---|
| NVIDIA GPU (Ampere / Hopper) | fp16 or bf16 — expect 1.5–2x speedup |
| NVIDIA Volta (V100) | fp16 — tensor cores present but bf16 not supported |
| AMD GPU (ROCm) | fp16 — check ROCm version; bf16 support varies |
| Apple Silicon (MPS) | Avoid — AMP is gated on cuda.is_available(), so it silently falls back to fp32. No error, no speedup. |
| CPU-only | Avoid — _use_amp = False, code runs fp32. Overhead: near zero, but no benefit. |
bf16 is numerically stabler than fp16 (no GradScaler needed in principle, though KynML still uses one for safety). Prefer bf16 on Ampere+. Use fp16 if your GPU predates Ampere or if you have existing fp16 infrastructure.
Default
precision = fp32 — no AMP, no imports, standard loss.backward() / optimizer.step().
torch.compile (compile = true)
What it does
Calls torch.compile(model) after instantiation. PyTorch 2.0+ traces and compiles the model graph, typically via TorchInductor. First epoch is slow (compilation); subsequent epochs are faster — expect 10–30 % speedup on GPU, variable on CPU.
train:
...
compile = true
Generated code
# in main():
model = HousePriceModel().to(device)
model = torch.compile(model) # compile = true
# vs.
# compile flag is False; skipping
When to use
| Scenario | Recommendation |
|---|---|
| GPU training, many epochs | Yes — amortises compilation cost after epoch 1 |
| GPU training, few epochs (< 5) | Likely not worth it — compilation overhead dominates |
| CPU training | Possible but modest gains; adds ~30–60s startup time |
| Apple MPS | Not supported in most PyTorch builds |
| Debugging | Disable — compiled graphs suppress readable tracebacks |
torch.compile and precision = fp16/bf16 stack cleanly — use both for maximum GPU throughput.
DataLoader workers (num_workers, pin_memory, prefetch)
These are dataset-block fields that control I/O parallelism.
num_workers
dataset HouseData:
source = csv("data/housing.csv")
target = "price"
num_workers = 4
Sets DataLoader(num_workers=4). Worker processes prefill batches while the GPU trains. Default is 0 (main process only).
Generated code:
dataloader_extra = "num_workers=4, pin_memory=False"
DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=4, pin_memory=False)
Guidelines:
- Start with
num_workers = 4on a machine with 8+ CPU cores and a GPU. - On CPU-only machines,
num_workers > 0adds IPC overhead with no GPU to overlap — keep at0. - On macOS, DataLoader workers use
forkby default which can conflict with some libraries. SetPYTHONWARNINGS=ignoreor explicitly usemultiprocessing_context="spawn"in the generated script if you see crashes. - Rule of thumb:
num_workers = num_CPU_cores / 2, capped at 8.
pin_memory
dataset HouseData:
source = csv("data/housing.csv")
target = "price"
num_workers = 4
pin_memory = true
Sets DataLoader(pin_memory=True). Allocates host memory in CUDA pinned (non-pageable) memory, enabling faster async host-to-device transfers via DMA.
- Only effective with a CUDA GPU. Has no effect and may add overhead on CPU-only or MPS.
- Requires
num_workers >= 1to see benefit — pinned memory is filled in the worker processes.
prefetch
dataset HouseData:
source = csv("data/housing.csv")
target = "price"
num_workers = 4
pin_memory = true
prefetch = 2
Sets DataLoader(prefetch_factor=2). Each worker pre-loads 2 batches ahead. Reduces GPU idle time between batches.
Generated code:
DataLoader(train_dataset, batch_size=32, shuffle=True,
num_workers=4, pin_memory=True, prefetch_factor=2)
Requires num_workers >= 1. Values of 2–4 are typical; larger values increase memory usage.
Recommended GPU config
dataset TrainData:
source = csv("data/train.csv")
target = "label"
num_workers = 4
pin_memory = true
prefetch = 2
Recommended CPU config
dataset TrainData:
source = csv("data/train.csv")
target = "label"
num_workers = 0 # or omit — 0 is default
pin_memory = false
Async checkpoint saving (async_save = true)
What it does
By default, checkpoint saves synchronously: torch.save(state, path) blocks the training loop. With async_save = true, the save is offloaded to a daemon thread so training continues immediately.
train:
...
checkpoint = checkpoint(every_n=5, path="checkpoints/ckpt.pt", async_save=true)
Generated code
# async_save = false (default):
torch.save(_state, _ckpt_path)
# async_save = true:
_t = threading.Thread(target=torch.save, args=(_state, _ckpt_path), daemon=True)
_t.start()
The threading import is always present in the generated script header.
When to use
- On large models (100M+ parameters) where
torch.savetakes > 1s per checkpoint. - When checkpointing every epoch and training on fast GPUs where the save becomes a measurable stall.
- Safe for regular checkpoint files. Daemon threads are joined when the process exits.
Caveats
- If training crashes immediately after
_t.start(), the checkpoint file may be partially written. The resume path (_resume_path.exists()) will still attempt to load it. Add error handling in the generated script if this is a concern. - Not a substitute for persistent storage on distributed jobs — use Hugging Face
Accelerateor PyTorch DCP for that.
Putting it all together
A high-throughput GPU training spec:
dataset LargeData:
source = csv("data/large_train.csv")
target = "outcome"
split = 0.8
normalize = true
num_workers = 4
pin_memory = true
prefetch = 2
model DeepNet:
input 128
dense 512 gelu
batchnorm
dropout 0.3
dense 256 gelu
batchnorm
dropout 0.2
dense 64 relu
dense 1 linear
train:
model = DeepNet
data = LargeData
loss = mse
optimizer = adamw(lr=0.001, weight_decay=0.01)
epochs = 100
batch = 256
device = auto
scheduler = cosine(t_max=100)
early_stop = early_stop(patience=10)
checkpoint = checkpoint(every_n=10, path="checkpoints/deepnet.pt", async_save=true)
precision = bf16
compile = true
evaluate:
metrics = [mae, rmse]
export:
format = torchscript
path = "models/deepnet.pt"
Expected speedup stack on an A100 vs baseline fp32 / compile=false / num_workers=0:
| Flag | Typical gain |
|---|---|
precision = bf16 |
1.5–2x |
compile = true |
1.1–1.3x |
num_workers = 4 |
1.1–1.4x (I/O bound workloads) |
pin_memory + prefetch |
1.05–1.15x |
Gains compound but with diminishing returns once training is fully compute-bound.
Reference: all speed flags
| Flag | Block | Default | Effect |
|---|---|---|---|
precision = fp16 |
train |
fp32 |
AMP with float16 and GradScaler |
precision = bf16 |
train |
fp32 |
AMP with bfloat16 and GradScaler |
compile = true |
train |
false |
torch.compile(model) |
num_workers = N |
dataset |
0 |
DataLoader worker processes |
pin_memory = true |
dataset |
false |
Pinned host memory for GPU transfer |
prefetch = N |
dataset |
none | DataLoader prefetch_factor |
async_save = true |
checkpoint(...) |
false |
Threaded torch.save |