Evaluations

Evaluations score a deployed model and give you both an aggregate score and per-sample results, so you can see exactly which inputs a model gets wrong — and compare checkpoints before promoting one. There are two ways to provide test data:

Your own dataset — a JSONL dataset you’ve uploaded, scored with scorers like exact_match and contains.
A curated benchmark — GSM8K, HumanEval, and friends. Benchmarks bring their own data, prompting, and scoring; you only pick a model.

How it works

Create an eval — a reusable definition: a dataset + scorers, or a benchmark task.
Start a run — the eval × a model target (deployment:<id>, a serving deployment).
The eval engine fans samples out concurrently (up to 100 in flight) against your deployment’s serving replicas. Code benchmarks execute each completion in an isolated sandbox.
Watch progress live; read the aggregate score and drill into per-sample items when it completes.

Runs are resumable: progress is checkpointed per item, so a run interrupted by a platform deploy picks up where it left off instead of starting over.

Quickstart (CLI)

Run GSM8K against a serving deployment — no dataset, no config file:

veri evals benchmarks                       # see what's available
veri evals create --benchmark gsm8k --model deployment:dep_abc123 --n 100 --follow

--follow streams progress and prints the score when the run completes:

Created eval_9f2c01 and started run run_5b7d22 against deployment:dep_abc123.
  [   1.2s] running  0/100
  [   9.8s] running  64/100
  [  14.1s] completed  100/100
  gsm8k: mean=0.8100  n=100  errors=0

To eval your own dataset instead:

veri evals create --quick \
  --dataset ds_abc123 \
  --model deployment:dep_abc123 \
  --scorer exact_match \
  --n 200 --follow

Or keep the definition in a TOML config (veri run configs/eval.toml works too):

kind = "eval"

[eval]
name = "gsm8k-baseline"

[[scorers]]
type = "benchmark"
name = "gsm8k"
task = "gsm8k"
sample_limit = 100

[run]
model = "deployment:dep_abc123"

Quickstart (dashboard)

On the Evaluations page:

Click + Create → Run a benchmark (or pick a dataset).
Choose the benchmark and a sample limit, then Create.
In the detail panel, pick a serving deployment and click Run eval.
Progress updates live; scores appear in the runs table when done.

Quickstart (SDK)

from veri_sdk import Client

client = Client()

ev = client.evals.create(
    name="gsm8k-baseline",
    scorers=[{"type": "benchmark", "name": "gsm8k", "task": "gsm8k", "sample_limit": 100}],
)
run = client.evals.runs.create(eval_id=ev.id, model="deployment:dep_abc123")
run = run.wait()                     # poll until terminal
print(run.results["scores"])         # {"gsm8k": {"mean": 0.81, "count": 100, ...}}

items = client.evals.runs.items(ev.id, run.id)   # per-sample inspection

Dataset format

Dataset evals expect JSONL rows with a prompt and a reference field (default label):

{"prompt": "What is 2+2?", "label": "4"}
{"prompt": [{"role": "user", "content": "Capital of France?"}], "label": "Paris"}

prompt is either a plain string or a chat-message array; label is what scorers compare against.

Results

A completed run carries per-scorer aggregates:

Field	Meaning
`mean`, `median`, `std`, `min`, `max`	Aggregate stats over scored samples
`count`	Samples that produced a score
`error_count`	Samples that errored (inference or scoring failure) — excluded from the stats

Per-sample rows (items) include the input, the model’s output, the reference label, each scorer’s score, and inference latency.

Model targets

Runs target deployment:<id> — a deployment in the serving state. Inference cost is attributed through the deployment’s normal request accounting; there is no separate eval billing. Targeting a scaled_to_zero deployment triggers its wake, and the run fails immediately with a hint to retry: the engine doesn’t hold eval samples open through a multi-minute GPU cold start. Retry the run once the deployment is back to serving, or pre-warm it first with veri deployments wake <id> (scale to zero).

Next steps

Benchmarks — the curated catalog and how code execution is sandboxed
Scorers — configuration reference for every scorer type
CLI reference — every veri evals command and flag

Get started

Training

Deployments

CLI

Volumes

Evaluations

How it works

Quickstart (CLI)

Quickstart (dashboard)

Quickstart (SDK)

Dataset format

Results

Model targets

Next steps

​How it works

​Quickstart (CLI)

​Quickstart (dashboard)

​Quickstart (SDK)

​Dataset format

​Results

​Model targets

​Next steps

How it works

Quickstart (CLI)

Quickstart (dashboard)

Quickstart (SDK)

Dataset format

Results

Model targets

Next steps