Docs Datasets & Connectors

Datasets and Connectors

How KynML loads training data — CSV files, HuggingFace Hub, Amazon S3, Cloudflare R2, and local Parquet.


Overview

Every dataset block declares a source that tells KynML where to read data from. The source is a function-call expression; the built-in connector is csv(). The kynml.integrations package adds huggingface and objectstore connectors for use in your own runtime scripts or extended training pipelines.

dataset <Name>:
    source = <connector-call>
    target = "<column-name>"
    split = 0.8         # train fraction, default 0.8
    normalize = false   # StandardScaler z-score, default false
    num_workers = 0     # DataLoader workers, default 0
    pin_memory = false  # DataLoader pin_memory, default false
    prefetch = <N>      # DataLoader prefetch_factor (optional)

CSV connector

The only connector supported natively by the compiler. The generated PyTorch script calls pd.read_csv() directly.

dataset HouseData:
    source = csv("data/housing.csv")
    target = "price"
    split = 0.8
    normalize = true

Path resolution — the path string is resolved at codegen time relative to CWD. If the file does not exist there, the compiler checks relative to the .kyn file's directory. The resolved absolute path is embedded in the generated script.

Column handlingpd.get_dummies is called on all feature columns. Numeric columns pass through as float32. String/boolean columns become one-hot integer columns (drop_first=False). The target column is removed from features before encoding.

Target dtypefloat32 for regression and BCE, int64 / torch.long for cross_entropy (multiclass). KynML infers this from loss.

No extras requiredpandas, numpy, and scikit-learn are all hard dependencies installed with the base package.


HuggingFace connector

Load any dataset from HuggingFace Hub as a pandas DataFrame.

Install

pip install 'kynml[hf]'
# installs: datasets>=2.0

Usage in Python

load_hf_dataset is in kynml.integrations.huggingface. Call it in your own data pipeline or in a post-compile script:

from kynml.integrations.huggingface import load_hf_dataset

df = load_hf_dataset(
    "mstz/heart_failure",   # HuggingFace dataset ID
    split="train",          # split to load ("train", "test", "validation", ...)
    target="DEATH_EVENT",   # optional — validates the column exists before returning
)

print(df.shape)            # (299, 13)
print(df.dtypes)

Once you have the DataFrame, save it to CSV and point your .kyn spec at it:

df.to_csv("data/heart_failure.csv", index=False)
dataset HeartData:
    source = csv("data/heart_failure.csv")
    target = "DEATH_EVENT"
    split = 0.8
    normalize = true

API

load_hf_dataset(
    dataset_id: str,        # e.g. "mstz/heart_failure", "stanfordnlp/imdb"
    split: str = "train",   # HuggingFace split name
    target: str | None = None,  # if set, raises ValueError if column absent
) -> pandas.DataFrame

Raises ImportError if datasets is not installed. Raises ValueError if target is specified but not found in the loaded dataset.

Example: MNIST (tabular form)

from kynml.integrations.huggingface import load_hf_dataset

df = load_hf_dataset("ylecun/mnist", split="train")
# Flatten images to a feature column or use a pixel-level flat CSV
df.to_csv("data/mnist_train.csv", index=False)

Object-store connector (S3, R2, local Parquet)

Load .parquet files from Amazon S3, Cloudflare R2, or the local filesystem.

Install

pip install 'kynml[objectstore]'
# installs: s3fs>=2023.0, pyarrow>=14.0

API

from kynml.integrations.objectstore import load_remote

df = load_remote(uri)

load_remote dispatches on the URI scheme:

URI pattern Backend Required env vars
s3://bucket/key.parquet Amazon S3 via s3fs AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION
r2://bucket/key.parquet Cloudflare R2 via s3fs + custom endpoint All S3 vars + R2_ENDPOINT_URL
/local/path/data.parquet Local Parquet via pyarrow

Only .parquet files are supported. CSV paths must use the csv() connector in the spec.

Amazon S3

import os
os.environ["AWS_ACCESS_KEY_ID"] = "AKIA..."
os.environ["AWS_SECRET_ACCESS_KEY"] = "..."
os.environ["AWS_DEFAULT_REGION"] = "us-east-1"

from kynml.integrations.objectstore import load_remote

df = load_remote("s3://my-ml-bucket/datasets/churn_2024.parquet")
df.to_csv("data/churn.csv", index=False)

Cloudflare R2

import os
os.environ["AWS_ACCESS_KEY_ID"] = "<r2-key-id>"
os.environ["AWS_SECRET_ACCESS_KEY"] = "<r2-secret>"
os.environ["R2_ENDPOINT_URL"] = "https://<account_id>.r2.cloudflarestorage.com"

from kynml.integrations.objectstore import load_remote

df = load_remote("r2://my-bucket/training/features.parquet")
df.to_csv("data/features.csv", index=False)

R2 uses the same s3fs path internally but passes the R2_ENDPOINT_URL as the endpoint override. The r2:// scheme is converted to s3:// before dispatch.

Local Parquet

from kynml.integrations.objectstore import load_remote

df = load_remote("/absolute/path/to/train.parquet")
# or relative paths work too:
df = load_remote("data/train.parquet")

Requires only pyarrow — no credentials needed.

End-to-end pipeline example

# 1. Pull from R2
from kynml.integrations.objectstore import load_remote
df = load_remote("r2://prod-bucket/datasets/transactions.parquet")

# 2. Save as CSV for KynML
df.to_csv("data/transactions.csv", index=False)

# 3. Compile and train
import subprocess
subprocess.run(["python", "-m", "kynml.cli", "train", "fraud_detector.kyn"], check=True)

Dataset block reference

All fields and their defaults:

Field Type Default Notes
source csv("path") required Only csv() in the spec; integrations used in Python pre-processing
target string required Target column name, case-sensitive
split float 0.8 Train fraction; must be (0, 1) exclusive
normalize bool false StandardScaler fit on train, transform on test
shuffle bool true Shuffle before split and in DataLoader
num_workers int 0 DataLoader worker processes; see Speed Guide
pin_memory bool false Pin host memory for GPU transfer; see Speed Guide
prefetch int none prefetch_factor on DataLoader; requires num_workers >= 1

Installing all extras at once

pip install 'kynml[all]'
# includes: serving, mcp, hf, objectstore