Compiler Internals

Per-module responsibilities and concrete extension points for adding new layers, optimizers, losses, activations, backends, and export formats.

Pipeline Stages

source → parse → AST → compose → validate → lower(IR) → infer(shapes) → Backend.emit

Each stage is handled by a distinct module with a single responsibility. The canonical entry point that chains them all is compile_to_ir() in kynml/pipeline.py.

Module Responsibilities

`kynml/parser.py`

Handwritten, indentation-aware LL(1) parser. No parser generator dependency.

Top-level blocks: import "path", params:, sweep:, dataset <Name>:, model <Name>:, train:, evaluate:, export:
Block bodies must be indented exactly 4 spaces; top-level lines at column 0.
Comments: lines whose first non-whitespace character is # are stripped before parsing.
Values: strings ("..."), integers, floats, booleans (true/false), lists ([a, b]), function calls (name(k=v, ...)), bare identifiers, and $name ParamRef tokens.
Function call arguments support nesting: scheduler=onecycle(max_lr=0.01).
Sweep axes require list values — lr = 0.001 inside a sweep: block is a parse error.

Key public API:

from kynml.parser import parse_file, parse_text

program = parse_file("model.kyn")       # reads file from disk
program = parse_text(spec_str)          # parses in-memory string

Errors raise KynMLParseError with file:line: message format.

`kynml/ast_nodes.py`

All AST nodes are frozen dataclasses — safe to hash, share, and inspect.

Node	Fields
`Program`	`datasets`, `models`, `train`, `evaluate`, `export`, `params`, `sweep`, `imports`
`ParamsBlock`	`values: dict[str, Any]`
`SweepBlock`	`axes: dict[str, list]`
`ParamRef`	`name: str` — sentinel for `$name` tokens before substitution
`DatasetBlock`	`name`, `source`, `target`, `split`, `normalize`, `shuffle`, `num_workers`, `pin_memory`, `prefetch`
`ModelBlock`	`name`, `layers`
`TrainBlock`	`model`, `data`, `loss`, `optimizer`, `epochs`, `batch`, `device`, `scheduler`, `early_stop`, `checkpoint`, `precision`, `compile`, `seed`, `deterministic`
`EvaluateBlock`	`metrics`
`ExportBlock`	`format`, `path`, `input_shape`, `opset`
`InputLayer`	`size`
`DenseLayer`	`units`, `activation`
`DropoutLayer`	`rate`
`BatchNorm1dLayer`	`features` (optional — inferred by the IR pass)
`FunctionCall`	`name`, `args`, `kwargs`

FunctionCall represents structured function values: adam(lr=0.001) parses to FunctionCall(name="adam", args=[], kwargs={"lr": 0.001}).

A ParamRef in an AST field (e.g. DenseLayer(units=ParamRef("hidden"))) is a sentinel that the composition pass replaces with a concrete value before validation. Programs that pass through compile_to_ir() always have ParamRef sentinels fully resolved before validate_program() runs.

`kynml/compose.py`

Composition pre-passes. Applied before semantic validation.

Pass 1 — resolve_imports(program, source_path)

Merges dataset and model blocks from imported .kyn files into the host program. Import resolution is recursive with cycle detection. Only dataset and model blocks are imported; train, evaluate, export, params, and sweep belong exclusively to the top-level program. Duplicate names between an imported file and the host raise KynMLSemanticError.

from kynml.compose import resolve_imports
program = resolve_imports(program, source_path="main.kyn")

Pass 2 — substitute_params(program, overrides=None)

Replaces all ParamRef sentinels with concrete values. Resolution order: overrides dict (from CLI --param) takes priority over params block defaults. A ParamRef name that appears in neither source raises KynMLSemanticError.

from kynml.compose import substitute_params
resolved = substitute_params(program, overrides={"hidden": 128})

Combined — apply_composition(program, source_path, overrides)

The canonical entry point; runs both passes in order. Programs with no import/params/sweep features pass through unchanged.

`kynml/semantic.py`

Pure validation — no code emission, no I/O. Raises KynMLSemanticError on violation.

Validates:

At least one dataset, one model, and a train block exist.
dataset.source must be csv("path") with exactly one positional argument.
Dataset split in (0, 1), num_workers >= 0.
Model starts with exactly one InputLayer, has at least one DenseLayer, no DenseLayer before InputLayer.
Layer units/size positive, dropout rate in [0, 1).
Activation in SUPPORTED_ACTIVATIONS, loss in SUPPORTED_LOSSES, optimizer in SUPPORTED_OPTIMIZERS.
train.model and train.data reference names defined in the same program.
epochs > 0, batch > 0.
device in {"auto", "cpu", "cuda"}, precision in {"fp32", "fp16", "bf16"}.
Scheduler in {"step", "cosine", "onecycle"}.
early_stop(patience=N) — patience > 0 if given.
checkpoint(every_n=N) — every_n > 0 if given.
Metrics in SUPPORTED_METRICS.
Export format in {"torch", "onnx", "torchscript"}; ONNX requires input_shape.

Typo suggestions use difflib.get_close_matches on unknown names.

Constants you can import for extension work:

from kynml.semantic import (
    SUPPORTED_ACTIVATIONS,    # {"relu", "sigmoid", "tanh", "linear", "leaky_relu", "gelu", "softmax", "log_softmax"}
    SUPPORTED_LOSSES,         # {"mse", "bce", "cross_entropy", "huber", "l1", "mae", "nll"}
    SUPPORTED_OPTIMIZERS,     # {"adam", "sgd", "adamw", "rmsprop"}
    SUPPORTED_METRICS,        # {"mae", "mse", "rmse", "accuracy"}
    SUPPORTED_EXPORT_FORMATS, # {"torch", "onnx", "torchscript"}
    SUPPORTED_SCHEDULERS,     # {"step", "cosine", "onecycle"}
    SUPPORTED_PRECISIONS,     # {"fp32", "fp16", "bf16"}
)

`kynml/ir/types.py`

Tensor type primitives.

class DType(str, Enum):
    FLOAT32 = "float32"
    FLOAT16 = "float16"
    BFLOAT16 = "bfloat16"
    INT64 = "int64"   # class-label targets (cross_entropy / nll)
    BOOL = "bool"

@dataclass(frozen=True)
class TypeShape:
    dims: tuple[int | None, ...]   # None = dynamic (batch axis)
    dtype: DType = DType.FLOAT32

    @property
    def feature_dim(self) -> int | None: ...   # last static dimension

TypeShape instances are immutable; the inference pass produces new instances via dataclasses.replace().

`kynml/ir/nodes.py`

Frozen IR dataclasses. All consume-only by backends.

Node	Key fields
`IRModule`	`datasets`, `graphs`, `train`, `evaluate`, `export`, `inferred`, `warnings`
`IRGraph`	`name`, `ops`, `input_type`, `output_type`
`IRTrain`	mirrors `TrainBlock` + `n_classes`, `target_type`, `seed`, `deterministic`
`InputOp`	`size`, `in_type`, `out_type`
`LinearOp`	`out_features`, `in_features` (filled by infer), `activation`, `in_type`, `out_type`
`DropoutOp`	`rate`, `in_type`, `out_type`
`BatchNorm1dOp`	`num_features` (filled by infer), `in_type`, `out_type`

IRModule.graph(name) and IRModule.dataset(name) are lookup helpers that raise KynMLCodegenError on a miss.

`kynml/ir/builder.py`

Structural lowering from AST to IR. Pure translation — no validation, no shape inference. Returns an IRModule with inferred=False.

from kynml.ir.builder import lower_program

module = lower_program(program, source_path="model.kyn")
# module.inferred == False; shapes not yet filled

LinearOp.in_features and BatchNorm1dOp.num_features are None at this stage.

`kynml/ir/infer.py`

Shape and type inference pass. Entry point: infer_module(module) -> IRModule.

The pass threads TypeShape through each graph op sequence:

InputOp(size=N) → emits TypeShape((None, N), FLOAT32).
LinearOp → reads feature_dim from current shape, fills in_features, advances shape to (None, out_features).
DropoutOp → shape-preserving.
BatchNorm1dOp — if num_features is None, inherits from current feature_dim; if both are set and disagree, raises KynMLShapeError.

After graph inference, _reconcile_train() enforces loss/output agreement:

Loss	Requirement	Error type
`bce`	final dense has exactly 1 output unit	`KynMLShapeError` (hard)
`cross_entropy`, `nll`	final dense has ≥2 output units	`KynMLShapeError` (hard)
`cross_entropy` + `softmax` activation	double-applies log-softmax	`warning` (soft)
`cross_entropy` + `log_softmax` activation	double log-softmax	`warning` (soft)
`mse`, `l1`, `mae`, `huber`	any output size	no check

Warnings accumulate on IRModule.warnings as strings. They are surfaced as severity="warning" diagnostics by the LSP and printed at compile time, but never block compilation.

All IR mutations are pure: the pass returns new frozen dataclass instances via dataclasses.replace(). The input module is never modified.

`kynml/ir/passes.py`

Pass registry. run_passes(module) -> IRModule runs the standard IR pass pipeline (currently: infer_module only). New IR optimisation passes plug in here.

`kynml/pipeline.py`

Canonical compilation front door.

from kynml.pipeline import compile_to_ir

module = compile_to_ir(program, source_path="model.kyn", overrides={"lr": 0.01})
# Runs: apply_composition → validate_program → lower_program → run_passes
# Returns IRModule with inferred=True

overrides maps param names to values, equivalent to --param on the CLI.

`kynml/codegen/base.py`

Backend ABC and context.

@dataclass(frozen=True)
class EmitContext:
    source_path: str | None = None
    project_dir: str | None = None
    extra: dict = field(default_factory=dict)

class Backend:
    name: str = "base"

    def emit(self, module: IRModule, ctx: EmitContext) -> str:
        raise NotImplementedError(...)

def get_backend(name: str = "pytorch") -> Backend: ...

Backends receive IRModule (always inferred=True) and must never inspect AST nodes directly. EmitContext.project_dir is injected so path resolution is deterministic in tests.

`kynml/codegen/pytorch_backend.py`

PyTorchBackend reads all shape information (in_features, num_features, n_classes, target_type) directly from the IR — it never re-derives dimensions. Its emit() method asserts module.inferred at entry.

Layer rendering dispatches on concrete op types; unknown ops are ignored (forward-compat). Activation dispatch is in _activation_expr(name) -> str | None (returns None for linear, which emits no nn.Module).

Reproducibility additions emitted by the backend (opt-in, via train block fields):
- seed → _set_seed(seed) in main() covering random, numpy, torch, and torch.cuda.
- deterministic = true → additionally sets torch.use_deterministic_algorithms(True) and torch.backends.cudnn.deterministic = True.
- CONFIG_HASH is computed from the .kyn source bytes at codegen time (sha256 hex, 64 chars).
- run_manifest.json is always written after training.

`kynml/codegen/pytorch.py`

Compatibility shim. generate_pytorch(program, source_path) and write_pytorch(program, out_path, source_path) have unchanged signatures but now route through compile_to_ir → PyTorchBackend.emit internally. All existing call sites work without modification.

`kynml/sweep.py`

Sweep grid expansion.

from kynml.sweep import expand_sweep, generate_sweep_runner

combos = expand_sweep(program)
# Returns list of (combo_dict, resolved_Program), one per Cartesian combination.

runner_src = generate_sweep_runner(combos, script_paths, out_dir)
# Returns a Python orchestrator script that runs all combos sequentially
# and aggregates run_manifest.json files into sweep_results.json.

If the program has no sweep block, expand_sweep returns [({}, resolved)] — a single entry with params substituted.

`kynml/repro/manifest.py`

Provenance types and utilities.

from kynml.repro.manifest import config_hash, data_hash, RunManifest, write_manifest

h = config_hash(kyn_source_bytes)   # sha256 hex, 64 chars
h2 = data_hash("data/iris.csv")     # sha256 hex or None if file absent
manifest = RunManifest(config_hash=h, data_hash=h2, env=env_info(), seed=42)
write_manifest("run_manifest.json", manifest)

The JSON layout is stable; future additions are backwards-compatible.

`kynml/lock.py`

Lock file support for drift detection.

from kynml.lock import create_lock, check_lock, LockMismatchError

create_lock(source_text, "kynml.lock", source_path="model.kyn")
check_lock(source_text, "kynml.lock")   # raises LockMismatchError on mismatch

The lock is opt-in; compile_to_ir is not modified. Wire --check-lock in a pre-train hook to enforce it.

`kynml/format/formatter.py`

Idempotent canonical .kyn formatter. Parses the source into the typed AST and re-emits in canonical form. Raises KynMLParseError on invalid input.

Rules: 4-space indentation, exactly one blank line between top-level blocks, no trailing whitespace, single trailing newline, canonical key order per block type.

from kynml.format.formatter import format_source, format_file

canonical = format_source(raw_kyn_text)
canonical = format_file("model.kyn", write=True)  # overwrites in place

Also invocable as python -m kynml.format <file>.

`kynml/lsp/diagnostics.py`

Pure diagnostics without a pygls dependency. Runs parse → validate → lower → infer and converts all KynMLError subclasses to Diagnostic dicts. Shape-inference warnings become severity="warning".

from kynml.lsp.diagnostics import diagnose

diags = diagnose(source_text)   # never raises
for d in diags:
    print(f"{d['line']}:{d['col']}: [{d['severity']}] {d['message']}")

Each Diagnostic is a TypedDict with keys: line, col, end_line, end_col, severity ("error" | "warning"), message, code ("parse" | "semantic" | "shape" | "warn").

The full LSP server (stdio) is started by kynml lsp and requires pygls (pip install 'kynml[lsp]').

Extension Points

Adding a New Activation

Add the name to SUPPORTED_ACTIVATIONS in kynml/semantic.py.
Add a mapping entry in _activation_expr() in kynml/codegen/pytorch_backend.py. Return None for activations that emit no module (like linear).

# semantic.py
SUPPORTED_ACTIVATIONS = {
    ..., "silu"
}

# codegen/pytorch_backend.py
def _activation_expr(name: str) -> str | None:
    mapping = {
        ...,
        "silu": "nn.SiLU()",
    }

No parser changes needed — dense N silu already parses as DenseLayer(units=N, activation="silu").

Adding a New Loss

Add to SUPPORTED_LOSSES in semantic.py.
Add to the mapping in _render_loss() in codegen/pytorch_backend.py.
If the loss has shape constraints (like bce requires 1 output unit, cross_entropy requires ≥2), add a branch in _reconcile_train() in ir/infer.py.

# semantic.py
SUPPORTED_LOSSES = {..., "focal"}

# codegen/pytorch_backend.py — _render_loss
"focal": "FocalLoss()"

Adding a New Optimizer

Add to SUPPORTED_OPTIMIZERS in semantic.py.
Add a branch in _render_optimizer() in codegen/pytorch_backend.py.

Adding a New Scheduler

Add to SUPPORTED_SCHEDULERS in semantic.py.
Add a branch in _render_scheduler_build() in codegen/pytorch_backend.py.

Adding a New Layer Op

Parser + AST + semantic (frontend):

Define a frozen dataclass in ast_nodes.py (e.g. Conv2dLayer).
Update the Layer type alias.
Add parsing logic in _parse_model() in parser.py.
Add validation in the model loop in semantic.py.

IR (lowering + inference):

Define an IROp subclass in ir/nodes.py (e.g. Conv2dOp).
Add a lowering branch in _lower_model() in ir/builder.py.
Add shape propagation logic in _infer_op() in ir/infer.py.

Backend (codegen):

Add rendering logic in PyTorchBackend._render_layers() in codegen/pytorch_backend.py.

This is the only change that touches all layers. Keep each change minimal and targeted.

Adding a New Backend

Implement Backend from kynml/codegen/base.py:

from kynml.codegen.base import Backend, EmitContext
from kynml.ir.nodes import IRModule

class JAXBackend(Backend):
    name = "jax"

    def emit(self, module: IRModule, ctx: EmitContext) -> str:
        assert module.inferred
        # Consume module.graphs, module.train, etc.
        ...

Add a branch in get_backend() in kynml/codegen/base.py:

if name == "jax":
    from kynml.codegen.jax_backend import JAXBackend
    return JAXBackend()

The backend receives a fully inferred IRModule and must never access AST nodes. No changes to the parser, semantic validator, or IR passes are required.

Adding a New Export Format

Add to SUPPORTED_EXPORT_FORMATS in semantic.py.
Add a branch in _render_export() in codegen/pytorch_backend.py.

Error Types

from kynml.errors import (
    KynMLError,          # base class
    KynMLParseError,     # from parser.py
    KynMLSemanticError,  # from semantic.py and compose.py
    KynMLShapeError,     # from ir/infer.py — dimension mismatches, loss/output conflicts
    KynMLCodegenError,   # from codegen/
)

All errors are subclasses of KynMLError so callers can catch the family with one except.

KynMLShapeError is a subclass of KynMLSemanticError. Shape errors are raised by the IR inference pass (post-AST) and include source-located messages like examples/model.kyn [train]: loss 'bce' expects a final dense layer with 1 output unit (got 3).

Compiler Internals

Pipeline Stages

Module Responsibilities

kynml/parser.py

kynml/ast_nodes.py

kynml/compose.py

kynml/semantic.py

kynml/ir/types.py

kynml/ir/nodes.py

kynml/ir/builder.py

kynml/ir/infer.py

kynml/ir/passes.py

kynml/pipeline.py

kynml/codegen/base.py

kynml/codegen/pytorch_backend.py

kynml/codegen/pytorch.py

kynml/sweep.py

kynml/repro/manifest.py

kynml/lock.py

kynml/format/formatter.py

kynml/lsp/diagnostics.py