Datasets - Veri Documentation

Upload Dataset

name

string

required

A human-readable name for the dataset (sent as a form field alongside the file).

file

required

A JSONL file where each line is a JSON object containing a prompt field.

Request

curl -X POST https://api.veri.studio/v1/datasets \
  -H "Authorization: Bearer vk_your_api_key" \
  -F "name=math_prompts" \
  -F "file=@math_prompts.jsonl"

Response

{
  "id": "ds_abc123",
  "object": "dataset",
  "name": "math_prompts",
  "source_type": "upload",
  "source_uri": null,
  "huggingface_dataset": null,
  "num_rows": 1500,
  "created_at": "2026-04-14T12:00:00Z"
}

Connect Dataset

name

string

required

Display name for the connected dataset.

source_type

string

required

One of s3, gs, az, hf, postgres, mysql, snowflake, or bigquery.

source_uri

string

Required for s3, gs, and az sources.

huggingface_dataset

string

Required for hf sources.

huggingface_config

object

Optional Hugging Face config, including split, column_mapping, and token.

db_connection

string

Required for SQL, Snowflake, and BigQuery-backed sources.

db_query

string

Required for database-backed sources.

credentials

object

Optional source credentials for validation. These are used to test the connection but are not stored — you will need to provide them again at training time if the source requires authentication.

Request

curl -X POST https://api.veri.studio/v1/datasets/connect \
  -H "Authorization: Bearer vk_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "gsm8k-train",
    "source_type": "hf",
    "huggingface_dataset": "gsm8k",
    "huggingface_config": {
      "split": "train",
      "column_mapping": {
        "question": "prompt",
        "answer": "label"
      }
    }
  }'

Response

{
  "id": "ds_hf123456789",
  "object": "dataset",
  "name": "gsm8k-train",
  "source_type": "hf",
  "source_uri": null,
  "huggingface_dataset": "gsm8k",
  "num_rows": null,
  "created_at": "2026-04-21T12:00:00Z"
}

Validate Dataset Connection

This endpoint tests a source and returns preview metadata without creating a dataset row.

Request

curl -X POST https://api.veri.studio/v1/datasets/connect/validate \
  -H "Authorization: Bearer vk_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "source_type": "s3",
    "source_uri": "s3://my-bucket/prompts.jsonl"
  }'

Response

{
  "valid": true,
  "num_rows": 1500,
  "columns": ["prompt", "expected_answer"],
  "error": null
}

List Datasets

limit

integer

default:"20"

Maximum number of datasets to return.

after

string

Cursor for pagination. Pass the ID of the last item from the previous page.

Request

curl "https://api.veri.studio/v1/datasets?limit=10" \
  -H "Authorization: Bearer vk_your_api_key"

Response

{
  "object": "list",
  "data": [
    {
      "id": "ds_abc123",
      "object": "dataset",
      "name": "math_prompts",
      "source_type": "upload",
      "source_uri": null,
      "huggingface_dataset": null,
      "num_rows": 1500,
      "created_at": "2026-04-14T12:00:00Z"
    },
    {
      "id": "ds_def456",
      "object": "dataset",
      "name": "gsm8k-train",
      "source_type": "hf",
      "source_uri": null,
      "huggingface_dataset": "gsm8k",
      "num_rows": null,
      "created_at": "2026-04-21T12:00:00Z"
    }
  ],
  "has_more": false
}

Get Dataset

Request

curl https://api.veri.studio/v1/datasets/ds_abc123 \
  -H "Authorization: Bearer vk_your_api_key"

Response

{
  "id": "ds_abc123",
  "object": "dataset",
  "name": "math_prompts",
  "source_type": "upload",
  "source_uri": null,
  "huggingface_dataset": null,
  "num_rows": 1500,
  "created_at": "2026-04-14T12:00:00Z"
}

Dataset Format

Each uploaded JSONL row should be a JSON object with at least a prompt field:

{"prompt": "What is 15 * 23?"}
{"prompt": "Solve for x: 2x + 5 = 17"}
{"prompt": "Explain the chain rule in calculus."}

The prompt value can be either a string or a list of chat messages. Additional fields stay available to reward functions and data resolvers.

Connected datasets are resolved at job start, so validation is the best way to catch auth or schema issues early.

Documentation Index

​Upload Dataset

​Request

​Response

​Connect Dataset

​Request

​Response

​Validate Dataset Connection

​Request

​Response

​List Datasets

​Request

​Response

​Get Dataset

​Request

​Response

​Dataset Format

Upload Dataset

Request

Response

Connect Dataset

Request

Response

Validate Dataset Connection

Request

Response

List Datasets

Request

Response

Get Dataset

Request

Response

Dataset Format