Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.veri.studio/llms.txt

Use this file to discover all available pages before exploring further.

Upload Dataset

name
string
required
A human-readable name for the dataset (sent as a form field alongside the file).
file
file
required
A JSONL file where each line is a JSON object containing a prompt field.

Request

curl -X POST https://api.veri.studio/v1/datasets \
  -H "Authorization: Bearer vk_your_api_key" \
  -F "name=math_prompts" \
  -F "file=@math_prompts.jsonl"

Response

{
  "id": "ds_abc123",
  "object": "dataset",
  "name": "math_prompts",
  "source_type": "upload",
  "source_uri": null,
  "huggingface_dataset": null,
  "num_rows": 1500,
  "created_at": "2026-04-14T12:00:00Z"
}

Connect Dataset

name
string
required
Display name for the connected dataset.
source_type
string
required
One of s3, gs, az, hf, postgres, mysql, snowflake, or bigquery.
source_uri
string
Required for s3, gs, and az sources.
huggingface_dataset
string
Required for hf sources.
huggingface_config
object
Optional Hugging Face config, including split, column_mapping, and token.
db_connection
string
Required for SQL, Snowflake, and BigQuery-backed sources.
db_query
string
Required for database-backed sources.
credentials
object
Optional source credentials for validation. These are used to test the connection but are not stored — you will need to provide them again at training time if the source requires authentication.

Request

curl -X POST https://api.veri.studio/v1/datasets/connect \
  -H "Authorization: Bearer vk_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "gsm8k-train",
    "source_type": "hf",
    "huggingface_dataset": "gsm8k",
    "huggingface_config": {
      "split": "train",
      "column_mapping": {
        "question": "prompt",
        "answer": "label"
      }
    }
  }'

Response

{
  "id": "ds_hf123456789",
  "object": "dataset",
  "name": "gsm8k-train",
  "source_type": "hf",
  "source_uri": null,
  "huggingface_dataset": "gsm8k",
  "num_rows": null,
  "created_at": "2026-04-21T12:00:00Z"
}

Validate Dataset Connection

This endpoint tests a source and returns preview metadata without creating a dataset row.

Request

curl -X POST https://api.veri.studio/v1/datasets/connect/validate \
  -H "Authorization: Bearer vk_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "source_type": "s3",
    "source_uri": "s3://my-bucket/prompts.jsonl"
  }'

Response

{
  "valid": true,
  "num_rows": 1500,
  "columns": ["prompt", "expected_answer"],
  "error": null
}

List Datasets

limit
integer
default:"20"
Maximum number of datasets to return.
after
string
Cursor for pagination. Pass the ID of the last item from the previous page.

Request

curl "https://api.veri.studio/v1/datasets?limit=10" \
  -H "Authorization: Bearer vk_your_api_key"

Response

{
  "object": "list",
  "data": [
    {
      "id": "ds_abc123",
      "object": "dataset",
      "name": "math_prompts",
      "source_type": "upload",
      "source_uri": null,
      "huggingface_dataset": null,
      "num_rows": 1500,
      "created_at": "2026-04-14T12:00:00Z"
    },
    {
      "id": "ds_def456",
      "object": "dataset",
      "name": "gsm8k-train",
      "source_type": "hf",
      "source_uri": null,
      "huggingface_dataset": "gsm8k",
      "num_rows": null,
      "created_at": "2026-04-21T12:00:00Z"
    }
  ],
  "has_more": false
}

Get Dataset

Request

curl https://api.veri.studio/v1/datasets/ds_abc123 \
  -H "Authorization: Bearer vk_your_api_key"

Response

{
  "id": "ds_abc123",
  "object": "dataset",
  "name": "math_prompts",
  "source_type": "upload",
  "source_uri": null,
  "huggingface_dataset": null,
  "num_rows": 1500,
  "created_at": "2026-04-14T12:00:00Z"
}

Dataset Format

Each uploaded JSONL row should be a JSON object with at least a prompt field:
{"prompt": "What is 15 * 23?"}
{"prompt": "Solve for x: 2x + 5 = 17"}
{"prompt": "Explain the chain rule in calculus."}
The prompt value can be either a string or a list of chat messages. Additional fields stay available to reward functions and data resolvers.
Connected datasets are resolved at job start, so validation is the best way to catch auth or schema issues early.