Datasets and Connectors
How KynML loads training data — CSV files, HuggingFace Hub, Amazon S3, Cloudflare R2, and local Parquet.
Overview
Every dataset block declares a source that tells KynML where to read data from. The source is a function-call expression; the built-in connector is csv(). The kynml.integrations package adds huggingface and objectstore connectors for use in your own runtime scripts or extended training pipelines.
dataset <Name>:
source = <connector-call>
target = "<column-name>"
split = 0.8 # train fraction, default 0.8
normalize = false # StandardScaler z-score, default false
num_workers = 0 # DataLoader workers, default 0
pin_memory = false # DataLoader pin_memory, default false
prefetch = <N> # DataLoader prefetch_factor (optional)
CSV connector
The only connector supported natively by the compiler. The generated PyTorch script calls pd.read_csv() directly.
dataset HouseData:
source = csv("data/housing.csv")
target = "price"
split = 0.8
normalize = true
Path resolution — the path string is resolved at codegen time relative to CWD. If the file does not exist there, the compiler checks relative to the .kyn file's directory. The resolved absolute path is embedded in the generated script.
Column handling — pd.get_dummies is called on all feature columns. Numeric columns pass through as float32. String/boolean columns become one-hot integer columns (drop_first=False). The target column is removed from features before encoding.
Target dtype — float32 for regression and BCE, int64 / torch.long for cross_entropy (multiclass). KynML infers this from loss.
No extras required — pandas, numpy, and scikit-learn are all hard dependencies installed with the base package.
HuggingFace connector
Load any dataset from HuggingFace Hub as a pandas DataFrame.
Install
pip install 'kynml[hf]'
# installs: datasets>=2.0
Usage in Python
load_hf_dataset is in kynml.integrations.huggingface. Call it in your own data pipeline or in a post-compile script:
from kynml.integrations.huggingface import load_hf_dataset
df = load_hf_dataset(
"mstz/heart_failure", # HuggingFace dataset ID
split="train", # split to load ("train", "test", "validation", ...)
target="DEATH_EVENT", # optional — validates the column exists before returning
)
print(df.shape) # (299, 13)
print(df.dtypes)
Once you have the DataFrame, save it to CSV and point your .kyn spec at it:
df.to_csv("data/heart_failure.csv", index=False)
dataset HeartData:
source = csv("data/heart_failure.csv")
target = "DEATH_EVENT"
split = 0.8
normalize = true
API
load_hf_dataset(
dataset_id: str, # e.g. "mstz/heart_failure", "stanfordnlp/imdb"
split: str = "train", # HuggingFace split name
target: str | None = None, # if set, raises ValueError if column absent
) -> pandas.DataFrame
Raises ImportError if datasets is not installed. Raises ValueError if target is specified but not found in the loaded dataset.
Example: MNIST (tabular form)
from kynml.integrations.huggingface import load_hf_dataset
df = load_hf_dataset("ylecun/mnist", split="train")
# Flatten images to a feature column or use a pixel-level flat CSV
df.to_csv("data/mnist_train.csv", index=False)
Object-store connector (S3, R2, local Parquet)
Load .parquet files from Amazon S3, Cloudflare R2, or the local filesystem.
Install
pip install 'kynml[objectstore]'
# installs: s3fs>=2023.0, pyarrow>=14.0
API
from kynml.integrations.objectstore import load_remote
df = load_remote(uri)
load_remote dispatches on the URI scheme:
| URI pattern | Backend | Required env vars |
|---|---|---|
s3://bucket/key.parquet |
Amazon S3 via s3fs |
AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION |
r2://bucket/key.parquet |
Cloudflare R2 via s3fs + custom endpoint |
All S3 vars + R2_ENDPOINT_URL |
/local/path/data.parquet |
Local Parquet via pyarrow |
— |
Only .parquet files are supported. CSV paths must use the csv() connector in the spec.
Amazon S3
import os
os.environ["AWS_ACCESS_KEY_ID"] = "AKIA..."
os.environ["AWS_SECRET_ACCESS_KEY"] = "..."
os.environ["AWS_DEFAULT_REGION"] = "us-east-1"
from kynml.integrations.objectstore import load_remote
df = load_remote("s3://my-ml-bucket/datasets/churn_2024.parquet")
df.to_csv("data/churn.csv", index=False)
Cloudflare R2
import os
os.environ["AWS_ACCESS_KEY_ID"] = "<r2-key-id>"
os.environ["AWS_SECRET_ACCESS_KEY"] = "<r2-secret>"
os.environ["R2_ENDPOINT_URL"] = "https://<account_id>.r2.cloudflarestorage.com"
from kynml.integrations.objectstore import load_remote
df = load_remote("r2://my-bucket/training/features.parquet")
df.to_csv("data/features.csv", index=False)
R2 uses the same s3fs path internally but passes the R2_ENDPOINT_URL as the endpoint override. The r2:// scheme is converted to s3:// before dispatch.
Local Parquet
from kynml.integrations.objectstore import load_remote
df = load_remote("/absolute/path/to/train.parquet")
# or relative paths work too:
df = load_remote("data/train.parquet")
Requires only pyarrow — no credentials needed.
End-to-end pipeline example
# 1. Pull from R2
from kynml.integrations.objectstore import load_remote
df = load_remote("r2://prod-bucket/datasets/transactions.parquet")
# 2. Save as CSV for KynML
df.to_csv("data/transactions.csv", index=False)
# 3. Compile and train
import subprocess
subprocess.run(["python", "-m", "kynml.cli", "train", "fraud_detector.kyn"], check=True)
Dataset block reference
All fields and their defaults:
| Field | Type | Default | Notes |
|---|---|---|---|
source |
csv("path") |
required | Only csv() in the spec; integrations used in Python pre-processing |
target |
string | required | Target column name, case-sensitive |
split |
float | 0.8 |
Train fraction; must be (0, 1) exclusive |
normalize |
bool | false |
StandardScaler fit on train, transform on test |
shuffle |
bool | true |
Shuffle before split and in DataLoader |
num_workers |
int | 0 |
DataLoader worker processes; see Speed Guide |
pin_memory |
bool | false |
Pin host memory for GPU transfer; see Speed Guide |
prefetch |
int | none | prefetch_factor on DataLoader; requires num_workers >= 1 |
Installing all extras at once
pip install 'kynml[all]'
# includes: serving, mcp, hf, objectstore