Datasets - Veri Documentation

Overview

Veri datasets can come from two places:

direct JSONL uploads stored in Veri-managed S3
connected external sources that are resolved into JSONL-like rows when a job starts

During GRPO training, the runner materializes rows into prompt records and passes the extra fields through to the reward path.

Supported Sources

upload - JSONL file uploaded to Veri
s3 - s3://bucket/key.jsonl
gs - gs://bucket/key.jsonl
az - az://container/blob.jsonl
hf - Hugging Face dataset ID plus optional config
postgres
mysql
snowflake
bigquery

Upload Format

Each line must be a valid JSON object with at least a prompt field:

{"prompt": "What is 15 * 23?"}
{"prompt": "Solve for x: 2x + 5 = 17"}
{"prompt": "Explain the chain rule in calculus."}

You can also store structured prompts and labels:

{"prompt": [{"role": "user", "content": "What is 15 * 23?"}], "label": "345"}
{"prompt": "Solve for x: 2x + 5 = 17", "expected_answer": "6"}

Extra fields such as label, expected_answer, or metadata stay attached to the row so reward functions can use them.

Requirements

File format: JSONL (one JSON object per line)
Required field: prompt
Prompt type: string or chat-style message array
Encoding: UTF-8

Preparing Your Dataset

Collect prompts

Gather prompts that represent the task you want the model to learn. Quality and diversity of prompts matter more than quantity.

Format as JSONL

Convert your prompts to JSONL format. Each line should be a valid JSON object:

import json

prompts = [
    "What is 15 * 23?",
    "Solve for x: 2x + 5 = 17",
    "What is the derivative of x^3 + 2x?",
]

with open("prompts.jsonl", "w") as f:
    for prompt in prompts:
        f.write(json.dumps({"prompt": prompt}) + "\n")

Upload to Veri

import veri

client = veri.Client(api_key="vk_your_api_key")
dataset = client.datasets.upload("prompts.jsonl")

Connect an external source

dataset = client.datasets.connect(
    name="gsm8k-train",
    source_type="hf",
    huggingface_dataset="gsm8k",
    huggingface_config={
        "split": "train",
        "column_mapping": {"question": "prompt", "answer": "label"},
    },
)

You can validate the connection before creating it:

preview = client.datasets.validate(
    source_type="s3",
    source_uri="s3://my-bucket/prompts.jsonl",
)
print(preview)

Tips

Diverse prompts: Include a range of difficulty levels and problem types to train a more robust model.
Enough data: A few hundred prompts can work, but 1,000+ is generally better for stable training.
Clean data: Remove duplicates, empty prompts, and malformed entries before uploading.
Map columns explicitly: For external sources, use Hugging Face column mapping or SQL aliases to normalize rows into the fields your reward function expects.
Match your task: Prompts should closely reflect the types of queries you expect at inference time.

Documentation Index

​Overview

​Supported Sources

​Upload Format

​Requirements

​Preparing Your Dataset

​Tips