Documentation Index
Fetch the complete documentation index at: https://docs.veri.studio/llms.txt
Use this file to discover all available pages before exploring further.
Work in progress. Evaluations are on the post-beta roadmap — this page describes the planned shape of the feature. The API and SDK surface below is illustrative and may change before launch. For early access or feedback, contact us.
Overview
Veri Evaluations will let you score a trained model — or a live deployment — against a fixed dataset and a reward function (or judge model), so you can compare runs and catch regressions before promoting a checkpoint. Today, the same reward functions you use for GRPO training double as evaluation primitives. Evaluations make that loop standalone:- Pick a model or deployment.
- Pick a dataset of prompts (and optional references).
- Pick one or more reward functions or judges.
- Get aggregate metrics + per-example traces.
Planned workflow
Pick what you're evaluating
A trained checkpoint, a deployment endpoint, or a base model (for baselines).
Pick a dataset
Reuse any Veri dataset, or upload an eval-only set of prompts and (optional) reference outputs.
Pick scorers
Reuse your existing reward functions, or attach a judge model (LLM-as-judge) for open-ended tasks.
What’s coming first
The first release will focus on the surface most teams already need:- Reward-function evals — score generations with the same Python reward function used for GRPO.
- Deterministic metrics — exact-match, regex, JSON-validity, function-calling correctness.
- Run comparisons — side-by-side scores across checkpoints from the same training job.
Coming after that
- LLM-as-judge with rubric prompts.
- Eval-only datasets (no training labels required).
- Scheduled evals against live deployments for drift detection.
- Public eval leaderboards for shared benchmarks.
In the meantime
You can already approximate evaluation today by:- Submitting a small training job with
max_steps: 0and your eval reward function — the reward signal at step 0 is your baseline score. - Running your reward function locally against generations from a deployment endpoint.