Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.veri.studio/llms.txt

Use this file to discover all available pages before exploring further.

Quickstart

Train a math reasoning model using GRPO — the same technique behind DeepSeek R1
In this quickstart, you’ll fine-tune Qwen 2.5 7B on the GSM8K dataset to produce a model that reasons step-by-step through grade school math problems. Veri handles GPU provisioning, job orchestration, and checkpoint delivery — you just provide the dataset and reward function.
Want the full script instead? Grab quickstart.py from the examples repo — a single self-contained Python file that walks through the entire flow with detailed comments.

Step 1: Get your API key

Create an API key from the Veri dashboard and export it:
export VERI_API_KEY="vk_your_api_key"

Step 2: Connect the GSM8K dataset

GSM8K is OpenAI’s dataset of 8,800 grade school math problems with step-by-step solutions. It’s the standard benchmark for training reasoning models. Veri can connect directly to Hugging Face — no download needed:
from veri_sdk import Client

client = Client()

dataset = client.datasets.connect(
    name="gsm8k-train",
    source_type="hf",
    huggingface_dataset="openai/gsm8k",
    huggingface_config={
        "config_name": "main",
        "split": "train",
        "column_mapping": {
            "question": "prompt",
            "answer": "label",
        },
    },
)
print(f"Dataset ID: {dataset.id}")
The column_mapping tells Veri that GSM8K’s question field is the prompt and answer is the label your reward function can check against.

Step 3: Write a reward function

GRPO works by generating multiple completions per prompt, scoring each one, and reinforcing the better responses. Your reward function defines what “better” means. For math reasoning, we want to reward two things:
  1. Correct answers — did the model get the right number?
  2. Structured reasoning — did it show its work in a clear format?
Save this as reward.py:
import re

def reward(completions, answer=None, **kwargs):
    """Score completions on correctness and reasoning format.

    Rewards:
      +1.0  correct final answer
      +0.5  structured <reasoning>...</reasoning><answer>...</answer> format
      +0.25 any numeric answer attempted (even if wrong)
    """
    scores = []
    for completion in completions:
        score = 0.0

        # Check for structured reasoning format
        has_reasoning = bool(re.search(
            r"<reasoning>.*?</reasoning>", completion, re.DOTALL
        ))
        has_answer_tag = bool(re.search(
            r"<answer>.*?</answer>", completion, re.DOTALL
        ))
        if has_reasoning and has_answer_tag:
            score += 0.5

        # Extract the numeric answer
        answer_match = re.search(r"<answer>\s*([\d,.]+)\s*</answer>", completion)
        if answer_match:
            predicted = answer_match.group(1).replace(",", "")
        else:
            # Fallback: grab the last number in the completion
            numbers = re.findall(r"[\d,.]+", completion)
            predicted = numbers[-1].replace(",", "") if numbers else None

        if predicted:
            score += 0.25  # attempted a numeric answer

            # Check correctness against the label
            if answer:
                # GSM8K answers have format "#### 42" — extract the number
                ref_match = re.search(r"####\s*([\d,.]+)", str(answer))
                ref = ref_match.group(1).replace(",", "") if ref_match else str(answer).strip()
                if predicted == ref:
                    score += 1.0

        scores.append(score)
    return scores
This reward function gives partial credit — the model learns structured reasoning first, then improves on correctness. The scoring breakdown:
BehaviorReward
Random/empty response0.0
Attempted a numeric answer0.25
Used <reasoning> + <answer> format0.5
Correct final answer+1.0
Perfect (correct + structured)1.75
Upload it:
reward_fn = client.reward_functions.upload("reward.py", format="trl")
print(f"Reward ID: {reward_fn.id}")

Step 4: Launch training

Now submit the GRPO training job. Veri automatically provisions the GPU, loads the dataset, and starts training.
job = client.training_jobs.create(
    base_model="Qwen/Qwen2.5-7B-Instruct",
    dataset_id=dataset.id,
    reward_function_id=reward_fn.id,
    output_name="qwen2.5-7b-math-reasoning",
    hyperparameters={
        "learning_rate": 1e-6,
        "rollouts_per_prompt": 8,
        "max_response_length": 2048,
        "kl_coef": 0.001,
    },
    gpu={"type": "A100-80GB", "count": 1},
)

print(f"Job ID: {job.id}")
print(f"Status: {job.status}")

What happens during training

For each prompt in GSM8K, GRPO:
  1. Generates 8 completions from the model
  2. Scores each with your reward function
  3. Computes relative advantages — completions that scored above the group mean get reinforced
  4. Updates the model weights while staying close to the original (controlled by kl_coef)
Over time, the model learns to produce structured reasoning that arrives at correct answers.

Step 5: Monitor and download

# Wait for completion (polls automatically)
job.wait()
print(f"Status: {job.status}")

# Download the fine-tuned model
if job.download_url:
    print(f"Checkpoint: {job.download_url}")

Step 6: Use your model

The output is a standard HuggingFace checkpoint. Load it directly:
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("./qwen2.5-7b-math-reasoning")
tokenizer = AutoTokenizer.from_pretrained("./qwen2.5-7b-math-reasoning")

prompt = "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?"

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
After training, you should see structured reasoning like:
<reasoning>
Natalia sold 48 clips in April.
In May, she sold half as many: 48 / 2 = 24 clips.
Total clips sold: 48 + 24 = 72
</reasoning>
<answer>72</answer>

What to try next

Customize the reward function

The reward function above is a starting point. You can add more signals:
  • Penalize overly long responses (-0.5 if over 500 tokens)
  • Reward step-by-step arithmetic (+0.25 per intermediate calculation)
  • Check unit consistency or mathematical notation

Try different models

Start small for fast iteration, scale up for quality:
ModelBest for
Qwen/Qwen2.5-1.5B-InstructFast experimentation, validating reward functions
Qwen/Qwen2.5-7B-InstructGood balance of quality and cost
meta-llama/Llama-3.1-8B-InstructStrong general reasoning baseline

Try video generation

Veri also supports LoRA SFT for video generation models like CogVideoX, Wan2.1, and LTX Video. See Video Gen Fine-Tuning.

Next steps

Reward Functions

Design better reward signals

GRPO Training

Understand GRPO hyperparameters

API Reference

Complete API documentation

Python SDK

Full SDK reference

Model Catalog

Browse 95+ supported models

Deployment

Serve your fine-tuned model