Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.veri.studio/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Veri datasets can come from two places:
  • direct JSONL uploads stored in Veri-managed S3
  • connected external sources that are resolved into JSONL-like rows when a job starts
During GRPO training, the runner materializes rows into prompt records and passes the extra fields through to the reward path.

Supported Sources

  • upload - JSONL file uploaded to Veri
  • s3 - s3://bucket/key.jsonl
  • gs - gs://bucket/key.jsonl
  • az - az://container/blob.jsonl
  • hf - Hugging Face dataset ID plus optional config
  • postgres
  • mysql
  • snowflake
  • bigquery

Upload Format

Each line must be a valid JSON object with at least a prompt field:
{"prompt": "What is 15 * 23?"}
{"prompt": "Solve for x: 2x + 5 = 17"}
{"prompt": "Explain the chain rule in calculus."}
You can also store structured prompts and labels:
{"prompt": [{"role": "user", "content": "What is 15 * 23?"}], "label": "345"}
{"prompt": "Solve for x: 2x + 5 = 17", "expected_answer": "6"}
Extra fields such as label, expected_answer, or metadata stay attached to the row so reward functions can use them.

Requirements

  • File format: JSONL (one JSON object per line)
  • Required field: prompt
  • Prompt type: string or chat-style message array
  • Encoding: UTF-8

Preparing Your Dataset

1

Collect prompts

Gather prompts that represent the task you want the model to learn. Quality and diversity of prompts matter more than quantity.
2

Format as JSONL

Convert your prompts to JSONL format. Each line should be a valid JSON object:
import json

prompts = [
    "What is 15 * 23?",
    "Solve for x: 2x + 5 = 17",
    "What is the derivative of x^3 + 2x?",
]

with open("prompts.jsonl", "w") as f:
    for prompt in prompts:
        f.write(json.dumps({"prompt": prompt}) + "\n")
3

Upload to Veri

import veri

client = veri.Client(api_key="vk_your_api_key")
dataset = client.datasets.upload("prompts.jsonl")
4

Connect an external source

dataset = client.datasets.connect(
    name="gsm8k-train",
    source_type="hf",
    huggingface_dataset="gsm8k",
    huggingface_config={
        "split": "train",
        "column_mapping": {"question": "prompt", "answer": "label"},
    },
)
You can validate the connection before creating it:
preview = client.datasets.validate(
    source_type="s3",
    source_uri="s3://my-bucket/prompts.jsonl",
)
print(preview)

Tips

  • Diverse prompts: Include a range of difficulty levels and problem types to train a more robust model.
  • Enough data: A few hundred prompts can work, but 1,000+ is generally better for stable training.
  • Clean data: Remove duplicates, empty prompts, and malformed entries before uploading.
  • Map columns explicitly: For external sources, use Hugging Face column mapping or SQL aliases to normalize rows into the fields your reward function expects.
  • Match your task: Prompts should closely reflect the types of queries you expect at inference time.