Train a math reasoning model using GRPO — the same technique behind DeepSeek R1
In this quickstart, you’ll fine-tune Qwen 2.5 7B on the GSM8K dataset to produce a model that reasons step-by-step through grade school math problems. Veri handles GPU provisioning, job orchestration, and checkpoint delivery — you just provide the dataset and reward function.
Want the full script instead? Grab quickstart.py from the examples repo — a single self-contained Python file that walks through the entire flow with detailed comments.
GSM8K is OpenAI’s dataset of 8,800 grade school math problems with step-by-step solutions. It’s the standard benchmark for training reasoning models.Veri can connect directly to Hugging Face — no download needed:
GRPO works by generating multiple completions per prompt, scoring each one, and reinforcing the better responses. Your reward function defines what “better” means.For math reasoning, we want to reward two things:
Correct answers — did the model get the right number?
Structured reasoning — did it show its work in a clear format?
Save this as reward.py:
import redef reward(completions, answer=None, **kwargs): """Score completions on correctness and reasoning format. Rewards: +1.0 correct final answer +0.5 structured <reasoning>...</reasoning><answer>...</answer> format +0.25 any numeric answer attempted (even if wrong) """ scores = [] for completion in completions: score = 0.0 # Check for structured reasoning format has_reasoning = bool(re.search( r"<reasoning>.*?</reasoning>", completion, re.DOTALL )) has_answer_tag = bool(re.search( r"<answer>.*?</answer>", completion, re.DOTALL )) if has_reasoning and has_answer_tag: score += 0.5 # Extract the numeric answer answer_match = re.search(r"<answer>\s*([\d,.]+)\s*</answer>", completion) if answer_match: predicted = answer_match.group(1).replace(",", "") else: # Fallback: grab the last number in the completion numbers = re.findall(r"[\d,.]+", completion) predicted = numbers[-1].replace(",", "") if numbers else None if predicted: score += 0.25 # attempted a numeric answer # Check correctness against the label if answer: # GSM8K answers have format "#### 42" — extract the number ref_match = re.search(r"####\s*([\d,.]+)", str(answer)) ref = ref_match.group(1).replace(",", "") if ref_match else str(answer).strip() if predicted == ref: score += 1.0 scores.append(score) return scores
This reward function gives partial credit — the model learns structured reasoning first, then improves on correctness. The scoring breakdown:
The output is a standard HuggingFace checkpoint. Load it directly:
from transformers import AutoModelForCausalLM, AutoTokenizermodel = AutoModelForCausalLM.from_pretrained("./qwen2.5-7b-math-reasoning")tokenizer = AutoTokenizer.from_pretrained("./qwen2.5-7b-math-reasoning")prompt = "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?"inputs = tokenizer(prompt, return_tensors="pt")outputs = model.generate(**inputs, max_new_tokens=512)print(tokenizer.decode(outputs[0], skip_special_tokens=True))
After training, you should see structured reasoning like:
<reasoning>Natalia sold 48 clips in April.In May, she sold half as many: 48 / 2 = 24 clips.Total clips sold: 48 + 24 = 72</reasoning><answer>72</answer>