Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.veri.studio/llms.txt

Use this file to discover all available pages before exploring further.

Overview

After training a model with Veri, you can deploy it as an OpenAI-compatible inference endpoint — or download the checkpoint and serve it yourself. Veri provides two deployment paths:
  1. Hosted deployment — deploy directly from a completed training job. Veri provisions GPU, serves the model, and gives you an OpenAI-compatible endpoint with per-request logging and metrics.
  2. Self-hosted — download the checkpoint and serve it in your own stack (vLLM, SGLang, Transformers, etc.).

Hosted Deployment

Create a deployment

Deploy a completed training job as an inference endpoint:
from veri_sdk import Client

client = Client()

deployment = client.deployments.create(
    model="JOB_ID",           # training job ID
    source="training_job",
    name="my-math-model",
    gpu={"gpu_type": "H100", "gpu_count": 1},
)
print(f"Deployment ID: {deployment.id}")
print(f"Status: {deployment.status}")
You can also deploy HuggingFace models directly:
curl -X POST https://api.veri.studio/v1/deployments \
  -H "Authorization: Bearer $VERI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-4B",
    "source": "huggingface",
    "name": "qwen3-base",
    "gpu": {"gpu_type": "A100-80GB", "gpu_count": 1}
  }'

Deployment lifecycle

1

Queued

Deployment created, waiting for GPU provisioning.
2

Provisioning

GPU instance is starting and the model is loading.
3

Serving

Endpoint is live and accepting inference requests.
4

Stopped

Deployment was stopped by the user. GPU released, billing stopped.

Send inference requests

Once the deployment is serving, send OpenAI-compatible chat completion requests:
from openai import OpenAI

client = OpenAI(
    base_url="https://api.veri.studio/v1/deployments/DEPLOYMENT_ID",
    api_key="vk_your_api_key",
)

response = client.chat.completions.create(
    model="my-math-model",
    messages=[
        {"role": "user", "content": "What is 15% of 240?"}
    ],
)
print(response.choices[0].message.content)

Monitor your deployment

View request history and aggregate metrics:
# Request history
curl "https://api.veri.studio/v1/deployments/DEPLOYMENT_ID/requests?limit=20" \
  -H "Authorization: Bearer $VERI_API_KEY"

# Aggregate metrics
curl "https://api.veri.studio/v1/deployments/DEPLOYMENT_ID/metrics" \
  -H "Authorization: Bearer $VERI_API_KEY"
Metrics include total requests, average latency, token usage, error rate, uptime, and total cost.

Stop a deployment

curl -X POST https://api.veri.studio/v1/deployments/DEPLOYMENT_ID/stop \
  -H "Authorization: Bearer $VERI_API_KEY"
Stopping a deployment releases the GPU and stops billing. You can view the deployment and its request history after stopping.

Billing

Deployments are billed per-hour based on GPU type and count. Credits are deducted hourly while the deployment is running. Check your burn rate across all active jobs and deployments:
curl "https://api.veri.studio/v1/billing/burn_rate" \
  -H "Authorization: Bearer $VERI_API_KEY"

Self-Hosted Deployment

If you prefer to serve models in your own infrastructure, download the checkpoint from a completed training job.

Checkpoint destinations

TypeWhat happens
veriVeri stores the checkpoint and returns a download URL.
s3Veri writes the checkpoint to your S3 path.
gsVeri writes the checkpoint to your Google Cloud Storage path.
azVeri writes the checkpoint to your Azure Blob path.

Download the checkpoint

curl https://api.veri.studio/v1/training_jobs/JOB_ID/model \
  -H "Authorization: Bearer $VERI_API_KEY"
StackBest for
vLLMProduction NVIDIA serving, multi-LoRA support
SGLangHigh-throughput serving, multi-turn workloads
TransformersDirect loading, custom pipelines

Example: serve with vLLM

pip install vllm

python -m vllm.entrypoints.openai.api_server \
  --model ./my-checkpoint \
  --host 0.0.0.0 \
  --port 8000
Your model is now serving at http://localhost:8000/v1/chat/completions with full OpenAI compatibility.

Deployment statuses

StatusDescription
queuedDeployment created, waiting for GPU
provisioningGPU starting, model loading
servingEndpoint live, accepting requests
unhealthyHealth check failed (auto-recovers)
stoppedStopped by user
failedDeployment failed; check error field