Speed Guide

Opt-in flags that reduce training time: AMP precision, torch.compile, DataLoader workers, memory pinning, prefetch, and async checkpointing.

Overview

KynML is conservative by default — fp32, no compilation, num_workers=0. Every flag in this guide is an explicit opt-in. Enable them incrementally; each adds measurable overhead if misused on the wrong hardware.

Mixed-precision training (`precision`)

What it does

precision = fp16 or precision = bf16 wraps the forward pass in torch.amp.autocast and uses GradScaler for gradient scaling. The backward pass stays in fp32 internally. Net effect: roughly 1.5–2x throughput on Ampere+ GPUs due to tensor core utilisation.

train:
    ...
    precision = fp16

train:
    ...
    precision = bf16

Generated code

from torch.cuda.amp import GradScaler, autocast

# in train_model():
_use_amp = torch.cuda.is_available()
_scaler = GradScaler() if _use_amp else None

# per batch:
_amp_ctx = torch.amp.autocast(
    device_type=device.type, dtype=torch.float16
) if _use_amp else contextlib.nullcontext()
with _amp_ctx:
    predictions = model(features)
    loss = criterion(predictions, target)
# scaler path:
if _scaler is not None:
    _scaler.scale(loss).backward()
    _scaler.step(optimizer)
    _scaler.update()
else:
    loss.backward()
    optimizer.step()

When to use

Scenario	Recommendation
NVIDIA GPU (Ampere / Hopper)	`fp16` or `bf16` — expect 1.5–2x speedup
NVIDIA Volta (V100)	`fp16` — tensor cores present but bf16 not supported
AMD GPU (ROCm)	`fp16` — check ROCm version; bf16 support varies
Apple Silicon (MPS)	Avoid — AMP is gated on `cuda.is_available()`, so it silently falls back to fp32. No error, no speedup.
CPU-only	Avoid — `_use_amp = False`, code runs fp32. Overhead: near zero, but no benefit.

bf16 is numerically stabler than fp16 (no GradScaler needed in principle, though KynML still uses one for safety). Prefer bf16 on Ampere+. Use fp16 if your GPU predates Ampere or if you have existing fp16 infrastructure.

Default

precision = fp32 — no AMP, no imports, standard loss.backward() / optimizer.step().

`torch.compile` (`compile = true`)

What it does

Calls torch.compile(model) after instantiation. PyTorch 2.0+ traces and compiles the model graph, typically via TorchInductor. First epoch is slow (compilation); subsequent epochs are faster — expect 10–30 % speedup on GPU, variable on CPU.

train:
    ...
    compile = true

Generated code

# in main():
model = HousePriceModel().to(device)
model = torch.compile(model)   # compile = true
# vs.
# compile flag is False; skipping

When to use

Scenario	Recommendation
GPU training, many epochs	Yes — amortises compilation cost after epoch 1
GPU training, few epochs (< 5)	Likely not worth it — compilation overhead dominates
CPU training	Possible but modest gains; adds ~30–60s startup time
Apple MPS	Not supported in most PyTorch builds
Debugging	Disable — compiled graphs suppress readable tracebacks

torch.compile and precision = fp16/bf16 stack cleanly — use both for maximum GPU throughput.

DataLoader workers (`num_workers`, `pin_memory`, `prefetch`)

These are dataset-block fields that control I/O parallelism.

`num_workers`

dataset HouseData:
    source = csv("data/housing.csv")
    target = "price"
    num_workers = 4

Sets DataLoader(num_workers=4). Worker processes prefill batches while the GPU trains. Default is 0 (main process only).

Generated code:

dataloader_extra = "num_workers=4, pin_memory=False"
DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=4, pin_memory=False)

Guidelines:

Start with num_workers = 4 on a machine with 8+ CPU cores and a GPU.
On CPU-only machines, num_workers > 0 adds IPC overhead with no GPU to overlap — keep at 0.
On macOS, DataLoader workers use fork by default which can conflict with some libraries. Set PYTHONWARNINGS=ignore or explicitly use multiprocessing_context="spawn" in the generated script if you see crashes.
Rule of thumb: num_workers = num_CPU_cores / 2, capped at 8.

`pin_memory`

dataset HouseData:
    source = csv("data/housing.csv")
    target = "price"
    num_workers = 4
    pin_memory = true

Sets DataLoader(pin_memory=True). Allocates host memory in CUDA pinned (non-pageable) memory, enabling faster async host-to-device transfers via DMA.

Only effective with a CUDA GPU. Has no effect and may add overhead on CPU-only or MPS.
Requires num_workers >= 1 to see benefit — pinned memory is filled in the worker processes.

`prefetch`

dataset HouseData:
    source = csv("data/housing.csv")
    target = "price"
    num_workers = 4
    pin_memory = true
    prefetch = 2

Sets DataLoader(prefetch_factor=2). Each worker pre-loads 2 batches ahead. Reduces GPU idle time between batches.

Generated code:

DataLoader(train_dataset, batch_size=32, shuffle=True,
           num_workers=4, pin_memory=True, prefetch_factor=2)

Requires num_workers >= 1. Values of 2–4 are typical; larger values increase memory usage.

Recommended GPU config

dataset TrainData:
    source = csv("data/train.csv")
    target = "label"
    num_workers = 4
    pin_memory = true
    prefetch = 2

Recommended CPU config

dataset TrainData:
    source = csv("data/train.csv")
    target = "label"
    num_workers = 0   # or omit — 0 is default
    pin_memory = false

Async checkpoint saving (`async_save = true`)

What it does

By default, checkpoint saves synchronously: torch.save(state, path) blocks the training loop. With async_save = true, the save is offloaded to a daemon thread so training continues immediately.

train:
    ...
    checkpoint = checkpoint(every_n=5, path="checkpoints/ckpt.pt", async_save=true)

Generated code

# async_save = false (default):
torch.save(_state, _ckpt_path)

# async_save = true:
_t = threading.Thread(target=torch.save, args=(_state, _ckpt_path), daemon=True)
_t.start()

The threading import is always present in the generated script header.

When to use

On large models (100M+ parameters) where torch.save takes > 1s per checkpoint.
When checkpointing every epoch and training on fast GPUs where the save becomes a measurable stall.
Safe for regular checkpoint files. Daemon threads are joined when the process exits.

Caveats

If training crashes immediately after _t.start(), the checkpoint file may be partially written. The resume path (_resume_path.exists()) will still attempt to load it. Add error handling in the generated script if this is a concern.
Not a substitute for persistent storage on distributed jobs — use Hugging Face Accelerate or PyTorch DCP for that.

Putting it all together

A high-throughput GPU training spec:

dataset LargeData:
    source = csv("data/large_train.csv")
    target = "outcome"
    split = 0.8
    normalize = true
    num_workers = 4
    pin_memory = true
    prefetch = 2

model DeepNet:
    input 128
    dense 512 gelu
    batchnorm
    dropout 0.3
    dense 256 gelu
    batchnorm
    dropout 0.2
    dense 64 relu
    dense 1 linear

train:
    model = DeepNet
    data = LargeData
    loss = mse
    optimizer = adamw(lr=0.001, weight_decay=0.01)
    epochs = 100
    batch = 256
    device = auto
    scheduler = cosine(t_max=100)
    early_stop = early_stop(patience=10)
    checkpoint = checkpoint(every_n=10, path="checkpoints/deepnet.pt", async_save=true)
    precision = bf16
    compile = true

evaluate:
    metrics = [mae, rmse]

export:
    format = torchscript
    path = "models/deepnet.pt"

Expected speedup stack on an A100 vs baseline fp32 / compile=false / num_workers=0:

Flag	Typical gain
`precision = bf16`	1.5–2x
`compile = true`	1.1–1.3x
`num_workers = 4`	1.1–1.4x (I/O bound workloads)
`pin_memory + prefetch`	1.05–1.15x

Gains compound but with diminishing returns once training is fully compute-bound.

Reference: all speed flags

Flag	Block	Default	Effect
`precision = fp16`	`train`	`fp32`	AMP with float16 and GradScaler
`precision = bf16`	`train`	`fp32`	AMP with bfloat16 and GradScaler
`compile = true`	`train`	`false`	`torch.compile(model)`
`num_workers = N`	`dataset`	`0`	DataLoader worker processes
`pin_memory = true`	`dataset`	`false`	Pinned host memory for GPU transfer
`prefetch = N`	`dataset`	none	DataLoader `prefetch_factor`
`async_save = true`	`checkpoint(...)`	`false`	Threaded `torch.save`

Speed Guide

Overview

Mixed-precision training (precision)

What it does

Generated code

When to use

Default

torch.compile (compile = true)

What it does

Generated code

When to use

DataLoader workers (num_workers, pin_memory, prefetch)

num_workers

pin_memory

prefetch

Recommended GPU config

Recommended CPU config

Async checkpoint saving (async_save = true)

What it does

Generated code

When to use

Caveats

Putting it all together

Reference: all speed flags

Mixed-precision training (`precision`)

`torch.compile` (`compile = true`)

DataLoader workers (`num_workers`, `pin_memory`, `prefetch`)

`num_workers`

`pin_memory`

`prefetch`

Async checkpoint saving (`async_save = true`)