Docs Compiler Internals

Compiler Internals

Per-module responsibilities and concrete extension points for adding new layers, optimizers, losses, activations, backends, and export formats.

Pipeline Stages

source → parse → AST → compose → validate → lower(IR) → infer(shapes) → Backend.emit

Each stage is handled by a distinct module with a single responsibility. The canonical entry point that chains them all is compile_to_ir() in kynml/pipeline.py.

Module Responsibilities

kynml/parser.py

Handwritten, indentation-aware LL(1) parser. No parser generator dependency.

  • Top-level blocks: import "path", params:, sweep:, dataset <Name>:, model <Name>:, train:, evaluate:, export:
  • Block bodies must be indented exactly 4 spaces; top-level lines at column 0.
  • Comments: lines whose first non-whitespace character is # are stripped before parsing.
  • Values: strings ("..."), integers, floats, booleans (true/false), lists ([a, b]), function calls (name(k=v, ...)), bare identifiers, and $name ParamRef tokens.
  • Function call arguments support nesting: scheduler=onecycle(max_lr=0.01).
  • Sweep axes require list values — lr = 0.001 inside a sweep: block is a parse error.

Key public API:

from kynml.parser import parse_file, parse_text

program = parse_file("model.kyn")       # reads file from disk
program = parse_text(spec_str)          # parses in-memory string

Errors raise KynMLParseError with file:line: message format.

kynml/ast_nodes.py

All AST nodes are frozen dataclasses — safe to hash, share, and inspect.

Node Fields
Program datasets, models, train, evaluate, export, params, sweep, imports
ParamsBlock values: dict[str, Any]
SweepBlock axes: dict[str, list]
ParamRef name: str — sentinel for $name tokens before substitution
DatasetBlock name, source, target, split, normalize, shuffle, num_workers, pin_memory, prefetch
ModelBlock name, layers
TrainBlock model, data, loss, optimizer, epochs, batch, device, scheduler, early_stop, checkpoint, precision, compile, seed, deterministic
EvaluateBlock metrics
ExportBlock format, path, input_shape, opset
InputLayer size
DenseLayer units, activation
DropoutLayer rate
BatchNorm1dLayer features (optional — inferred by the IR pass)
FunctionCall name, args, kwargs

FunctionCall represents structured function values: adam(lr=0.001) parses to FunctionCall(name="adam", args=[], kwargs={"lr": 0.001}).

A ParamRef in an AST field (e.g. DenseLayer(units=ParamRef("hidden"))) is a sentinel that the composition pass replaces with a concrete value before validation. Programs that pass through compile_to_ir() always have ParamRef sentinels fully resolved before validate_program() runs.

kynml/compose.py

Composition pre-passes. Applied before semantic validation.

Pass 1 — resolve_imports(program, source_path)

Merges dataset and model blocks from imported .kyn files into the host program. Import resolution is recursive with cycle detection. Only dataset and model blocks are imported; train, evaluate, export, params, and sweep belong exclusively to the top-level program. Duplicate names between an imported file and the host raise KynMLSemanticError.

from kynml.compose import resolve_imports
program = resolve_imports(program, source_path="main.kyn")

Pass 2 — substitute_params(program, overrides=None)

Replaces all ParamRef sentinels with concrete values. Resolution order: overrides dict (from CLI --param) takes priority over params block defaults. A ParamRef name that appears in neither source raises KynMLSemanticError.

from kynml.compose import substitute_params
resolved = substitute_params(program, overrides={"hidden": 128})

Combined — apply_composition(program, source_path, overrides)

The canonical entry point; runs both passes in order. Programs with no import/params/sweep features pass through unchanged.

kynml/semantic.py

Pure validation — no code emission, no I/O. Raises KynMLSemanticError on violation.

Validates:

  • At least one dataset, one model, and a train block exist.
  • dataset.source must be csv("path") with exactly one positional argument.
  • Dataset split in (0, 1), num_workers >= 0.
  • Model starts with exactly one InputLayer, has at least one DenseLayer, no DenseLayer before InputLayer.
  • Layer units/size positive, dropout rate in [0, 1).
  • Activation in SUPPORTED_ACTIVATIONS, loss in SUPPORTED_LOSSES, optimizer in SUPPORTED_OPTIMIZERS.
  • train.model and train.data reference names defined in the same program.
  • epochs > 0, batch > 0.
  • device in {"auto", "cpu", "cuda"}, precision in {"fp32", "fp16", "bf16"}.
  • Scheduler in {"step", "cosine", "onecycle"}.
  • early_stop(patience=N)patience > 0 if given.
  • checkpoint(every_n=N)every_n > 0 if given.
  • Metrics in SUPPORTED_METRICS.
  • Export format in {"torch", "onnx", "torchscript"}; ONNX requires input_shape.

Typo suggestions use difflib.get_close_matches on unknown names.

Constants you can import for extension work:

from kynml.semantic import (
    SUPPORTED_ACTIVATIONS,    # {"relu", "sigmoid", "tanh", "linear", "leaky_relu", "gelu", "softmax", "log_softmax"}
    SUPPORTED_LOSSES,         # {"mse", "bce", "cross_entropy", "huber", "l1", "mae", "nll"}
    SUPPORTED_OPTIMIZERS,     # {"adam", "sgd", "adamw", "rmsprop"}
    SUPPORTED_METRICS,        # {"mae", "mse", "rmse", "accuracy"}
    SUPPORTED_EXPORT_FORMATS, # {"torch", "onnx", "torchscript"}
    SUPPORTED_SCHEDULERS,     # {"step", "cosine", "onecycle"}
    SUPPORTED_PRECISIONS,     # {"fp32", "fp16", "bf16"}
)

kynml/ir/types.py

Tensor type primitives.

class DType(str, Enum):
    FLOAT32 = "float32"
    FLOAT16 = "float16"
    BFLOAT16 = "bfloat16"
    INT64 = "int64"   # class-label targets (cross_entropy / nll)
    BOOL = "bool"

@dataclass(frozen=True)
class TypeShape:
    dims: tuple[int | None, ...]   # None = dynamic (batch axis)
    dtype: DType = DType.FLOAT32

    @property
    def feature_dim(self) -> int | None: ...   # last static dimension

TypeShape instances are immutable; the inference pass produces new instances via dataclasses.replace().

kynml/ir/nodes.py

Frozen IR dataclasses. All consume-only by backends.

Node Key fields
IRModule datasets, graphs, train, evaluate, export, inferred, warnings
IRGraph name, ops, input_type, output_type
IRTrain mirrors TrainBlock + n_classes, target_type, seed, deterministic
InputOp size, in_type, out_type
LinearOp out_features, in_features (filled by infer), activation, in_type, out_type
DropoutOp rate, in_type, out_type
BatchNorm1dOp num_features (filled by infer), in_type, out_type

IRModule.graph(name) and IRModule.dataset(name) are lookup helpers that raise KynMLCodegenError on a miss.

kynml/ir/builder.py

Structural lowering from AST to IR. Pure translation — no validation, no shape inference. Returns an IRModule with inferred=False.

from kynml.ir.builder import lower_program

module = lower_program(program, source_path="model.kyn")
# module.inferred == False; shapes not yet filled

LinearOp.in_features and BatchNorm1dOp.num_features are None at this stage.

kynml/ir/infer.py

Shape and type inference pass. Entry point: infer_module(module) -> IRModule.

The pass threads TypeShape through each graph op sequence:

  • InputOp(size=N) → emits TypeShape((None, N), FLOAT32).
  • LinearOp → reads feature_dim from current shape, fills in_features, advances shape to (None, out_features).
  • DropoutOp → shape-preserving.
  • BatchNorm1dOp — if num_features is None, inherits from current feature_dim; if both are set and disagree, raises KynMLShapeError.

After graph inference, _reconcile_train() enforces loss/output agreement:

Loss Requirement Error type
bce final dense has exactly 1 output unit KynMLShapeError (hard)
cross_entropy, nll final dense has ≥2 output units KynMLShapeError (hard)
cross_entropy + softmax activation double-applies log-softmax warning (soft)
cross_entropy + log_softmax activation double log-softmax warning (soft)
mse, l1, mae, huber any output size no check

Warnings accumulate on IRModule.warnings as strings. They are surfaced as severity="warning" diagnostics by the LSP and printed at compile time, but never block compilation.

All IR mutations are pure: the pass returns new frozen dataclass instances via dataclasses.replace(). The input module is never modified.

kynml/ir/passes.py

Pass registry. run_passes(module) -> IRModule runs the standard IR pass pipeline (currently: infer_module only). New IR optimisation passes plug in here.

kynml/pipeline.py

Canonical compilation front door.

from kynml.pipeline import compile_to_ir

module = compile_to_ir(program, source_path="model.kyn", overrides={"lr": 0.01})
# Runs: apply_composition → validate_program → lower_program → run_passes
# Returns IRModule with inferred=True

overrides maps param names to values, equivalent to --param on the CLI.

kynml/codegen/base.py

Backend ABC and context.

@dataclass(frozen=True)
class EmitContext:
    source_path: str | None = None
    project_dir: str | None = None
    extra: dict = field(default_factory=dict)

class Backend:
    name: str = "base"

    def emit(self, module: IRModule, ctx: EmitContext) -> str:
        raise NotImplementedError(...)

def get_backend(name: str = "pytorch") -> Backend: ...

Backends receive IRModule (always inferred=True) and must never inspect AST nodes directly. EmitContext.project_dir is injected so path resolution is deterministic in tests.

kynml/codegen/pytorch_backend.py

PyTorchBackend reads all shape information (in_features, num_features, n_classes, target_type) directly from the IR — it never re-derives dimensions. Its emit() method asserts module.inferred at entry.

Layer rendering dispatches on concrete op types; unknown ops are ignored (forward-compat). Activation dispatch is in _activation_expr(name) -> str | None (returns None for linear, which emits no nn.Module).

Reproducibility additions emitted by the backend (opt-in, via train block fields):
- seed_set_seed(seed) in main() covering random, numpy, torch, and torch.cuda.
- deterministic = true → additionally sets torch.use_deterministic_algorithms(True) and torch.backends.cudnn.deterministic = True.
- CONFIG_HASH is computed from the .kyn source bytes at codegen time (sha256 hex, 64 chars).
- run_manifest.json is always written after training.

kynml/codegen/pytorch.py

Compatibility shim. generate_pytorch(program, source_path) and write_pytorch(program, out_path, source_path) have unchanged signatures but now route through compile_to_ir → PyTorchBackend.emit internally. All existing call sites work without modification.

kynml/sweep.py

Sweep grid expansion.

from kynml.sweep import expand_sweep, generate_sweep_runner

combos = expand_sweep(program)
# Returns list of (combo_dict, resolved_Program), one per Cartesian combination.

runner_src = generate_sweep_runner(combos, script_paths, out_dir)
# Returns a Python orchestrator script that runs all combos sequentially
# and aggregates run_manifest.json files into sweep_results.json.

If the program has no sweep block, expand_sweep returns [({}, resolved)] — a single entry with params substituted.

kynml/repro/manifest.py

Provenance types and utilities.

from kynml.repro.manifest import config_hash, data_hash, RunManifest, write_manifest

h = config_hash(kyn_source_bytes)   # sha256 hex, 64 chars
h2 = data_hash("data/iris.csv")     # sha256 hex or None if file absent
manifest = RunManifest(config_hash=h, data_hash=h2, env=env_info(), seed=42)
write_manifest("run_manifest.json", manifest)

The JSON layout is stable; future additions are backwards-compatible.

kynml/lock.py

Lock file support for drift detection.

from kynml.lock import create_lock, check_lock, LockMismatchError

create_lock(source_text, "kynml.lock", source_path="model.kyn")
check_lock(source_text, "kynml.lock")   # raises LockMismatchError on mismatch

The lock is opt-in; compile_to_ir is not modified. Wire --check-lock in a pre-train hook to enforce it.

kynml/format/formatter.py

Idempotent canonical .kyn formatter. Parses the source into the typed AST and re-emits in canonical form. Raises KynMLParseError on invalid input.

Rules: 4-space indentation, exactly one blank line between top-level blocks, no trailing whitespace, single trailing newline, canonical key order per block type.

from kynml.format.formatter import format_source, format_file

canonical = format_source(raw_kyn_text)
canonical = format_file("model.kyn", write=True)  # overwrites in place

Also invocable as python -m kynml.format <file>.

kynml/lsp/diagnostics.py

Pure diagnostics without a pygls dependency. Runs parse → validate → lower → infer and converts all KynMLError subclasses to Diagnostic dicts. Shape-inference warnings become severity="warning".

from kynml.lsp.diagnostics import diagnose

diags = diagnose(source_text)   # never raises
for d in diags:
    print(f"{d['line']}:{d['col']}: [{d['severity']}] {d['message']}")

Each Diagnostic is a TypedDict with keys: line, col, end_line, end_col, severity ("error" | "warning"), message, code ("parse" | "semantic" | "shape" | "warn").

The full LSP server (stdio) is started by kynml lsp and requires pygls (pip install 'kynml[lsp]').

Extension Points

Adding a New Activation

  1. Add the name to SUPPORTED_ACTIVATIONS in kynml/semantic.py.
  2. Add a mapping entry in _activation_expr() in kynml/codegen/pytorch_backend.py. Return None for activations that emit no module (like linear).
# semantic.py
SUPPORTED_ACTIVATIONS = {
    ..., "silu"
}

# codegen/pytorch_backend.py
def _activation_expr(name: str) -> str | None:
    mapping = {
        ...,
        "silu": "nn.SiLU()",
    }

No parser changes needed — dense N silu already parses as DenseLayer(units=N, activation="silu").

Adding a New Loss

  1. Add to SUPPORTED_LOSSES in semantic.py.
  2. Add to the mapping in _render_loss() in codegen/pytorch_backend.py.
  3. If the loss has shape constraints (like bce requires 1 output unit, cross_entropy requires ≥2), add a branch in _reconcile_train() in ir/infer.py.
# semantic.py
SUPPORTED_LOSSES = {..., "focal"}

# codegen/pytorch_backend.py — _render_loss
"focal": "FocalLoss()"

Adding a New Optimizer

  1. Add to SUPPORTED_OPTIMIZERS in semantic.py.
  2. Add a branch in _render_optimizer() in codegen/pytorch_backend.py.

Adding a New Scheduler

  1. Add to SUPPORTED_SCHEDULERS in semantic.py.
  2. Add a branch in _render_scheduler_build() in codegen/pytorch_backend.py.

Adding a New Layer Op

Parser + AST + semantic (frontend):

  1. Define a frozen dataclass in ast_nodes.py (e.g. Conv2dLayer).
  2. Update the Layer type alias.
  3. Add parsing logic in _parse_model() in parser.py.
  4. Add validation in the model loop in semantic.py.

IR (lowering + inference):

  1. Define an IROp subclass in ir/nodes.py (e.g. Conv2dOp).
  2. Add a lowering branch in _lower_model() in ir/builder.py.
  3. Add shape propagation logic in _infer_op() in ir/infer.py.

Backend (codegen):

  1. Add rendering logic in PyTorchBackend._render_layers() in codegen/pytorch_backend.py.

This is the only change that touches all layers. Keep each change minimal and targeted.

Adding a New Backend

  1. Implement Backend from kynml/codegen/base.py:
from kynml.codegen.base import Backend, EmitContext
from kynml.ir.nodes import IRModule

class JAXBackend(Backend):
    name = "jax"

    def emit(self, module: IRModule, ctx: EmitContext) -> str:
        assert module.inferred
        # Consume module.graphs, module.train, etc.
        ...
  1. Add a branch in get_backend() in kynml/codegen/base.py:
if name == "jax":
    from kynml.codegen.jax_backend import JAXBackend
    return JAXBackend()

The backend receives a fully inferred IRModule and must never access AST nodes. No changes to the parser, semantic validator, or IR passes are required.

Adding a New Export Format

  1. Add to SUPPORTED_EXPORT_FORMATS in semantic.py.
  2. Add a branch in _render_export() in codegen/pytorch_backend.py.

Error Types

from kynml.errors import (
    KynMLError,          # base class
    KynMLParseError,     # from parser.py
    KynMLSemanticError,  # from semantic.py and compose.py
    KynMLShapeError,     # from ir/infer.py — dimension mismatches, loss/output conflicts
    KynMLCodegenError,   # from codegen/
)

All errors are subclasses of KynMLError so callers can catch the family with one except.

KynMLShapeError is a subclass of KynMLSemanticError. Shape errors are raised by the IR inference pass (post-AST) and include source-located messages like examples/model.kyn [train]: loss 'bce' expects a final dense layer with 1 output unit (got 3).

See Also