Datasets and Connectors

How KynML loads training data — CSV files, HuggingFace Hub, Amazon S3, Cloudflare R2, and local Parquet.

Overview

Every dataset block declares a source that tells KynML where to read data from. The source is a function-call expression; the built-in connector is csv(). The kynml.integrations package adds huggingface and objectstore connectors for use in your own runtime scripts or extended training pipelines.

dataset <Name>:
    source = <connector-call>
    target = "<column-name>"
    split = 0.8         # train fraction, default 0.8
    normalize = false   # StandardScaler z-score, default false
    num_workers = 0     # DataLoader workers, default 0
    pin_memory = false  # DataLoader pin_memory, default false
    prefetch = <N>      # DataLoader prefetch_factor (optional)

CSV connector

The only connector supported natively by the compiler. The generated PyTorch script calls pd.read_csv() directly.

dataset HouseData:
    source = csv("data/housing.csv")
    target = "price"
    split = 0.8
    normalize = true

Path resolution — the path string is resolved at codegen time relative to CWD. If the file does not exist there, the compiler checks relative to the .kyn file's directory. The resolved absolute path is embedded in the generated script.

Column handling — pd.get_dummies is called on all feature columns. Numeric columns pass through as float32. String/boolean columns become one-hot integer columns (drop_first=False). The target column is removed from features before encoding.

Target dtype — float32 for regression and BCE, int64 / torch.long for cross_entropy (multiclass). KynML infers this from loss.

No extras required — pandas, numpy, and scikit-learn are all hard dependencies installed with the base package.

HuggingFace connector

Load any dataset from HuggingFace Hub as a pandas DataFrame.

Install

pip install 'kynml[hf]'
# installs: datasets>=2.0

Usage in Python

load_hf_dataset is in kynml.integrations.huggingface. Call it in your own data pipeline or in a post-compile script:

from kynml.integrations.huggingface import load_hf_dataset

df = load_hf_dataset(
    "mstz/heart_failure",   # HuggingFace dataset ID
    split="train",          # split to load ("train", "test", "validation", ...)
    target="DEATH_EVENT",   # optional — validates the column exists before returning
)

print(df.shape)            # (299, 13)
print(df.dtypes)

Once you have the DataFrame, save it to CSV and point your .kyn spec at it:

df.to_csv("data/heart_failure.csv", index=False)

dataset HeartData:
    source = csv("data/heart_failure.csv")
    target = "DEATH_EVENT"
    split = 0.8
    normalize = true

API

load_hf_dataset(
    dataset_id: str,        # e.g. "mstz/heart_failure", "stanfordnlp/imdb"
    split: str = "train",   # HuggingFace split name
    target: str | None = None,  # if set, raises ValueError if column absent
) -> pandas.DataFrame

Raises ImportError if datasets is not installed. Raises ValueError if target is specified but not found in the loaded dataset.

Example: MNIST (tabular form)

from kynml.integrations.huggingface import load_hf_dataset

df = load_hf_dataset("ylecun/mnist", split="train")
# Flatten images to a feature column or use a pixel-level flat CSV
df.to_csv("data/mnist_train.csv", index=False)

Object-store connector (S3, R2, local Parquet)

Load .parquet files from Amazon S3, Cloudflare R2, or the local filesystem.

Install

pip install 'kynml[objectstore]'
# installs: s3fs>=2023.0, pyarrow>=14.0

API

from kynml.integrations.objectstore import load_remote

df = load_remote(uri)

load_remote dispatches on the URI scheme:

URI pattern	Backend	Required env vars
`s3://bucket/key.parquet`	Amazon S3 via `s3fs`	`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_DEFAULT_REGION`
`r2://bucket/key.parquet`	Cloudflare R2 via `s3fs` + custom endpoint	All S3 vars + `R2_ENDPOINT_URL`
`/local/path/data.parquet`	Local Parquet via `pyarrow`	—

Only .parquet files are supported. CSV paths must use the csv() connector in the spec.

Amazon S3

import os
os.environ["AWS_ACCESS_KEY_ID"] = "AKIA..."
os.environ["AWS_SECRET_ACCESS_KEY"] = "..."
os.environ["AWS_DEFAULT_REGION"] = "us-east-1"

from kynml.integrations.objectstore import load_remote

df = load_remote("s3://my-ml-bucket/datasets/churn_2024.parquet")
df.to_csv("data/churn.csv", index=False)

Cloudflare R2

import os
os.environ["AWS_ACCESS_KEY_ID"] = "<r2-key-id>"
os.environ["AWS_SECRET_ACCESS_KEY"] = "<r2-secret>"
os.environ["R2_ENDPOINT_URL"] = "https://<account_id>.r2.cloudflarestorage.com"

from kynml.integrations.objectstore import load_remote

df = load_remote("r2://my-bucket/training/features.parquet")
df.to_csv("data/features.csv", index=False)

R2 uses the same s3fs path internally but passes the R2_ENDPOINT_URL as the endpoint override. The r2:// scheme is converted to s3:// before dispatch.

Local Parquet

from kynml.integrations.objectstore import load_remote

df = load_remote("/absolute/path/to/train.parquet")
# or relative paths work too:
df = load_remote("data/train.parquet")

Requires only pyarrow — no credentials needed.

End-to-end pipeline example

# 1. Pull from R2
from kynml.integrations.objectstore import load_remote
df = load_remote("r2://prod-bucket/datasets/transactions.parquet")

# 2. Save as CSV for KynML
df.to_csv("data/transactions.csv", index=False)

# 3. Compile and train
import subprocess
subprocess.run(["python", "-m", "kynml.cli", "train", "fraud_detector.kyn"], check=True)

Dataset block reference

All fields and their defaults:

Field	Type	Default	Notes
`source`	`csv("path")`	required	Only `csv()` in the spec; integrations used in Python pre-processing
`target`	string	required	Target column name, case-sensitive
`split`	float	`0.8`	Train fraction; must be `(0, 1)` exclusive
`normalize`	bool	`false`	`StandardScaler` fit on train, transform on test
`shuffle`	bool	`true`	Shuffle before split and in DataLoader
`num_workers`	int	`0`	DataLoader worker processes; see Speed Guide
`pin_memory`	bool	`false`	Pin host memory for GPU transfer; see Speed Guide
`prefetch`	int	none	`prefetch_factor` on DataLoader; requires `num_workers >= 1`

Installing all extras at once

pip install 'kynml[all]'
# includes: serving, mcp, hf, objectstore