Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.veri.studio/llms.txt

Use this file to discover all available pages before exploring further.

What is GRPO?

Group Relative Policy Optimization (GRPO) is the fine-tuning method Veri uses when training against reward signals. Unlike supervised fine-tuning (SFT), which learns from fixed examples, GRPO lets the model explore different responses and learn from rewards — improving the behaviors you care about. GRPO was introduced by the DeepSeek team and is the algorithm behind models like DeepSeek-R1. It is a variant of policy optimization that avoids the need for a separate critic/value model, making it more memory-efficient than PPO.

How GRPO Works

1

Generate completions

For each prompt in the dataset, the model generates multiple completions. In Veri’s current schema this is controlled by rollouts_per_prompt.
2

Score with reward function

Your reward function scores every completion. This produces a set of scores for each prompt.
3

Compute relative advantages

Instead of using an absolute baseline (like a value network in PPO), GRPO computes advantages relative to the group. Completions that scored above the group mean get positive advantage; those below get negative advantage.
4

Update the policy

The model’s weights are updated to increase the probability of high-advantage completions and decrease the probability of low-advantage ones, subject to a KL divergence constraint against the reference model.

Why GRPO Over PPO?

GRPOPPO
Critic modelNot neededRequired (extra memory)
Memory efficiencyHigherLower
BaselineGroup-relativeLearned value function
StabilityComparableComparable
ImplementationSimplerMore complex
GRPO achieves comparable results to PPO while using significantly less GPU memory, since it does not need to maintain a separate value model.

Hyperparameters

When creating a training job, you can configure the following hyperparameters:
ParameterDefaultDescription
learning_rate1e-6Learning rate for the optimizer. Lower values are more stable but slower.
num_epochs1Number of passes through the dataset.
max_stepsnullOptional explicit step cap.
rollouts_per_prompt8Number of completions generated per prompt. Higher values improve group comparisons but use more compute.
kl_coef0.001KL penalty coefficient against the reference model.
max_prompt_length1024Prompt token cap.
max_response_length2048Completion token cap.
global_batch_size64Total batch size across the job.
seed42Random seed.

Tuning Tips

Start with 1e-6 for most models. If training is unstable (reward oscillates), lower it. If reward barely moves, test a slightly higher value.
More rollouts per prompt give a better estimate of the group advantage, but cost proportionally more compute. Start with 8 unless you are constrained by memory or time.
These two settings are usually the first levers to pull when you hit OOM conditions. Recent end-to-end testing hit an A100 40GB memory ceiling, so conservative response lengths are still the safer default.
1-3 epochs is typical for reward-based fine-tuning. More epochs risk overfitting to the reward function.

Infrastructure

Veri routes GRPO jobs to NVIDIA GPUs across multiple compute providers:
  • Modal — Serverless GPU functions with per-second billing.
  • Lambda Cloud — On-demand GPU instances.
  • RunPod — Pod-based GPU compute.
  • **** — VM-based GPU instances.
Available GPU types include A100 (80GB), H100 (80GB), L40S, and others depending on the provider. GPU type and count must be specified explicitly in the gpu object on job creation. You can specify a provider or omit it to let Veri auto-select the cheapest available option. The request can also choose a checkpoint_destination, which lets the final checkpoint land in Veri-managed storage or your own S3, GCS, or Azure location.

Monitoring

During training, you can:
  • Stream logs via the GET /v1/training_jobs/{id}/logs SSE endpoint to see real-time loss and reward metrics.
  • View Grafana dashboards linked from your Veri dashboard for detailed training curves, GPU utilization, and memory usage.

Output

When training completes, Veri produces a model checkpoint in HuggingFace format. This includes:
  • Model weights (pytorch_model.bin or safetensors)
  • Tokenizer files
  • Model configuration
The checkpoint can be loaded directly with HuggingFace Transformers for inference or further fine-tuning. The checkpoint can be loaded directly with AutoModelForCausalLM.from_pretrained() for inference or further fine-tuning.