Docs Regression

Tutorial: Regression — Housing Price Prediction

End-to-end walkthrough: train a feedforward network to predict a continuous target from tabular data.


What you will build

A model that reads a CSV of housing features, trains with MSE loss, evaluates with MAE and RMSE, and exports a portable .pt state dict. The full pipeline runs with a single kynml train command.


Prerequisites

pip install kynml          # core install — no extras needed for CSV + torch export

Python 3.11+. PyTorch is a hard dependency and is installed automatically.


1. Prepare your data

KynML's CSV connector reads any delimiter-separated file with a header row. Categorical columns are one-hot encoded automatically via pd.get_dummies; numeric columns pass through.

The bundled example dataset lives at data/housing.csv (10 numeric features, target column price):

f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,price
1200,3,2,15,0.20,8,1,0,5,2,240000
1500,3,2,10,0.25,7,1,1,6,2,285000
...

2. Write the spec

Save as house_price.kyn:

dataset HouseData:
    source = csv("data/housing.csv")
    target = "price"
    split = 0.8
    normalize = true

model HousePriceModel:
    input 10
    dense 64 relu
    dense 32 relu
    dense 1 linear

train:
    model = HousePriceModel
    data = HouseData
    loss = mse
    optimizer = adam(lr=0.001)
    epochs = 20
    batch = 32
    device = auto

evaluate:
    metrics = [mae, rmse]

export:
    format = torch
    path = "models/house_price_model.pt"

Block-by-block breakdown

dataset — declares the data source and preprocessing:

Key Value Notes
source csv("data/housing.csv") Path is relative to CWD at compile time
target "price" Column name to predict
split 0.8 80 % train / 20 % test
normalize true Z-score normalisation via StandardScaler (fit on train, applied to test)

model — defines a nn.Sequential network:

  • input 10 — declares the feature dimension (must match data after encoding)
  • dense 64 relunn.Linear(10, 64) followed by nn.ReLU()
  • dense 32 relunn.Linear(64, 32) followed by nn.ReLU()
  • dense 1 linearnn.Linear(32, 1) with no activation (raw regression output)

train — training loop parameters:

  • loss = msenn.MSELoss()
  • optimizer = adam(lr=0.001)optim.Adam(..., lr=0.001)
  • device = auto → CUDA if available, else CPU

evaluate — computed after training on the held-out test split. Supported: mae, mse, rmse, accuracy.

exportformat = torch calls torch.save(model.state_dict(), path).


3. Validate and compile

# Check syntax and semantics without generating code
.venv/bin/python -m kynml.cli validate house_price.kyn

# Inspect the parsed AST
.venv/bin/python -m kynml.cli ast house_price.kyn

# Emit the PyTorch script without running it
.venv/bin/python -m kynml.cli compile house_price.kyn --out generated/house_price.py

4. Train

.venv/bin/python -m kynml.cli train house_price.kyn

Output:

Epoch 1/20 - loss: 12345678.2341
Epoch 2/20 - loss: 9823456.7812
...
Epoch 20/20 - loss: 1234567.3401
mae: 8432.1234
rmse: 11203.5678
Saved model to /path/to/models/house_price_model.pt

5. What the generated PyTorch looks like

kynml compile emits a complete, standalone Python file. The key sections for regression:

# Dataset loading (regression path — targets as float32, shape [-1, 1])
def load_dataset() -> tuple[DataLoader, DataLoader]:
    df = pd.read_csv(DATASET_PATH)
    features = df.drop(columns=[TARGET_COLUMN])
    target = df[TARGET_COLUMN]
    numeric_features = pd.get_dummies(features, drop_first=False)
    x = numeric_features.astype("float32").to_numpy()
    y = target.astype("float32").to_numpy()
    if y.ndim == 1:
        y = y.reshape(-1, 1)
    x_train, x_test, y_train, y_test = train_test_split(
        x, y, train_size=0.8, shuffle=True, random_state=42,
    )
    if NORMALIZE:
        scaler = StandardScaler()
        x_train = scaler.fit_transform(x_train)
        x_test = scaler.transform(x_test)
    ...

# Model class — name taken from the model block identifier
class HousePriceModel(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(10, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 1),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x)

# Export
def export_model(model: nn.Module) -> None:
    if EXPORT_FORMAT == "torch":
        torch.save(model.state_dict(), path)

6. Next steps

  • Add regularisation: dropout, batchnorm
  • Speed up training: Speed Guideprecision = fp16, compile = true, num_workers
  • Serve predictions: see the kynml.serving module for generate_service
  • Switch to AdamW for weight decay: optimizer = adamw(lr=0.001, weight_decay=0.01)

Troubleshooting

Target column 'price' not found — confirm the column name exactly matches target in your spec (case-sensitive).

input size mismatch — the input N value must equal the number of columns after pd.get_dummies encoding. Print numeric_features.shape[1] in the generated script to check.

CUDA out of memory — reduce batch size or switch device = cpu.