Documentation Index
Fetch the complete documentation index at: https://docs.veri.studio/llms.txt
Use this file to discover all available pages before exploring further.
Overview
After training a model with Veri, you can deploy it as an OpenAI-compatible inference endpoint — or download the checkpoint and serve it yourself.
Veri provides two deployment paths:
- Hosted deployment — deploy directly from a completed training job. Veri provisions GPU, serves the model, and gives you an OpenAI-compatible endpoint with per-request logging and metrics.
- Self-hosted — download the checkpoint and serve it in your own stack (vLLM, SGLang, Transformers, etc.).
Hosted Deployment
Create a deployment
Deploy a completed training job as an inference endpoint:
from veri_sdk import Client
client = Client()
deployment = client.deployments.create(
model="JOB_ID", # training job ID
source="training_job",
name="my-math-model",
gpu={"gpu_type": "H100", "gpu_count": 1},
)
print(f"Deployment ID: {deployment.id}")
print(f"Status: {deployment.status}")
curl -X POST https://api.veri.studio/v1/deployments \
-H "Authorization: Bearer $VERI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "JOB_ID",
"source": "training_job",
"name": "my-math-model",
"gpu": {
"gpu_type": "H100",
"gpu_count": 1
}
}'
You can also deploy HuggingFace models directly:
curl -X POST https://api.veri.studio/v1/deployments \
-H "Authorization: Bearer $VERI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-4B",
"source": "huggingface",
"name": "qwen3-base",
"gpu": {"gpu_type": "A100-80GB", "gpu_count": 1}
}'
Deployment lifecycle
Queued
Deployment created, waiting for GPU provisioning.
Provisioning
GPU instance is starting and the model is loading.
Serving
Endpoint is live and accepting inference requests.
Stopped
Deployment was stopped by the user. GPU released, billing stopped.
Send inference requests
Once the deployment is serving, send OpenAI-compatible chat completion requests:
from openai import OpenAI
client = OpenAI(
base_url="https://api.veri.studio/v1/deployments/DEPLOYMENT_ID",
api_key="vk_your_api_key",
)
response = client.chat.completions.create(
model="my-math-model",
messages=[
{"role": "user", "content": "What is 15% of 240?"}
],
)
print(response.choices[0].message.content)
curl -X POST https://api.veri.studio/v1/deployments/DEPLOYMENT_ID/chat/completions \
-H "Authorization: Bearer $VERI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "my-math-model",
"messages": [
{"role": "user", "content": "What is 15% of 240?"}
]
}'
Monitor your deployment
View request history and aggregate metrics:
# Request history
curl "https://api.veri.studio/v1/deployments/DEPLOYMENT_ID/requests?limit=20" \
-H "Authorization: Bearer $VERI_API_KEY"
# Aggregate metrics
curl "https://api.veri.studio/v1/deployments/DEPLOYMENT_ID/metrics" \
-H "Authorization: Bearer $VERI_API_KEY"
Metrics include total requests, average latency, token usage, error rate, uptime, and total cost.
Stop a deployment
curl -X POST https://api.veri.studio/v1/deployments/DEPLOYMENT_ID/stop \
-H "Authorization: Bearer $VERI_API_KEY"
Stopping a deployment releases the GPU and stops billing. You can view the deployment and its request history after stopping.
Billing
Deployments are billed per-hour based on GPU type and count. Credits are deducted hourly while the deployment is running. Check your burn rate across all active jobs and deployments:
curl "https://api.veri.studio/v1/billing/burn_rate" \
-H "Authorization: Bearer $VERI_API_KEY"
Self-Hosted Deployment
If you prefer to serve models in your own infrastructure, download the checkpoint from a completed training job.
Checkpoint destinations
| Type | What happens |
|---|
veri | Veri stores the checkpoint and returns a download URL. |
s3 | Veri writes the checkpoint to your S3 path. |
gs | Veri writes the checkpoint to your Google Cloud Storage path. |
az | Veri writes the checkpoint to your Azure Blob path. |
Download the checkpoint
curl https://api.veri.studio/v1/training_jobs/JOB_ID/model \
-H "Authorization: Bearer $VERI_API_KEY"
Recommended serving stacks
| Stack | Best for |
|---|
| vLLM | Production NVIDIA serving, multi-LoRA support |
| SGLang | High-throughput serving, multi-turn workloads |
| Transformers | Direct loading, custom pipelines |
Example: serve with vLLM
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model ./my-checkpoint \
--host 0.0.0.0 \
--port 8000
Your model is now serving at http://localhost:8000/v1/chat/completions with full OpenAI compatibility.
Deployment statuses
| Status | Description |
|---|
queued | Deployment created, waiting for GPU |
provisioning | GPU starting, model loading |
serving | Endpoint live, accepting requests |
unhealthy | Health check failed (auto-recovers) |
stopped | Stopped by user |
failed | Deployment failed; check error field |