Deployment - Veri Documentation

Overview

After training a model with Veri, you can deploy it as an OpenAI-compatible inference endpoint — or download the checkpoint and serve it yourself. Veri provides two deployment paths:

Hosted deployment — deploy directly from a completed training job. Veri provisions GPU, serves the model, and gives you an OpenAI-compatible endpoint with per-request logging and metrics.
Self-hosted — download the checkpoint and serve it in your own stack (vLLM, SGLang, Transformers, etc.).

Hosted Deployment

Create a deployment

Deploy a completed training job as an inference endpoint:

Python SDK
curl

from veri_sdk import Client

client = Client()

deployment = client.deployments.create(
    model="JOB_ID",           # training job ID
    source="training_job",
    name="my-math-model",
    gpu={"gpu_type": "H100", "gpu_count": 1},
)
print(f"Deployment ID: {deployment.id}")
print(f"Status: {deployment.status}")

curl -X POST https://api.veri.studio/v1/deployments \
  -H "Authorization: Bearer $VERI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "JOB_ID",
    "source": "training_job",
    "name": "my-math-model",
    "gpu": {
      "gpu_type": "H100",
      "gpu_count": 1
    }
  }'

You can also deploy HuggingFace models directly:

curl -X POST https://api.veri.studio/v1/deployments \
  -H "Authorization: Bearer $VERI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-4B",
    "source": "huggingface",
    "name": "qwen3-base",
    "gpu": {"gpu_type": "A100-80GB", "gpu_count": 1}
  }'

Deployment lifecycle

Queued

Deployment created, waiting for GPU provisioning.

Provisioning

GPU instance is starting and the model is loading.

Serving

Endpoint is live and accepting inference requests.

Stopped

Deployment was stopped by the user. GPU released, billing stopped.

Send inference requests

Once the deployment is serving, send OpenAI-compatible chat completion requests:

Python (OpenAI SDK)
curl

from openai import OpenAI

client = OpenAI(
    base_url="https://api.veri.studio/v1/deployments/DEPLOYMENT_ID",
    api_key="vk_your_api_key",
)

response = client.chat.completions.create(
    model="my-math-model",
    messages=[
        {"role": "user", "content": "What is 15% of 240?"}
    ],
)
print(response.choices[0].message.content)

curl -X POST https://api.veri.studio/v1/deployments/DEPLOYMENT_ID/chat/completions \
  -H "Authorization: Bearer $VERI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-math-model",
    "messages": [
      {"role": "user", "content": "What is 15% of 240?"}
    ]
  }'

Monitor your deployment

View request history and aggregate metrics:

# Request history
curl "https://api.veri.studio/v1/deployments/DEPLOYMENT_ID/requests?limit=20" \
  -H "Authorization: Bearer $VERI_API_KEY"

# Aggregate metrics
curl "https://api.veri.studio/v1/deployments/DEPLOYMENT_ID/metrics" \
  -H "Authorization: Bearer $VERI_API_KEY"

Metrics include total requests, average latency, token usage, error rate, uptime, and total cost.

Stop a deployment

curl -X POST https://api.veri.studio/v1/deployments/DEPLOYMENT_ID/stop \
  -H "Authorization: Bearer $VERI_API_KEY"

Stopping a deployment releases the GPU and stops billing. You can view the deployment and its request history after stopping.

Billing

Deployments are billed per-hour based on GPU type and count. Credits are deducted hourly while the deployment is running. Check your burn rate across all active jobs and deployments:

curl "https://api.veri.studio/v1/billing/burn_rate" \
  -H "Authorization: Bearer $VERI_API_KEY"

Self-Hosted Deployment

If you prefer to serve models in your own infrastructure, download the checkpoint from a completed training job.

Checkpoint destinations

Type	What happens
`veri`	Veri stores the checkpoint and returns a download URL.
`s3`	Veri writes the checkpoint to your S3 path.
`gs`	Veri writes the checkpoint to your Google Cloud Storage path.
`az`	Veri writes the checkpoint to your Azure Blob path.

Download the checkpoint

curl https://api.veri.studio/v1/training_jobs/JOB_ID/model \
  -H "Authorization: Bearer $VERI_API_KEY"

Recommended serving stacks

Stack	Best for
vLLM	Production NVIDIA serving, multi-LoRA support
SGLang	High-throughput serving, multi-turn workloads
Transformers	Direct loading, custom pipelines

Example: serve with vLLM

pip install vllm

python -m vllm.entrypoints.openai.api_server \
  --model ./my-checkpoint \
  --host 0.0.0.0 \
  --port 8000

Your model is now serving at http://localhost:8000/v1/chat/completions with full OpenAI compatibility.

Deployment statuses

Status	Description
`queued`	Deployment created, waiting for GPU
`provisioning`	GPU starting, model loading
`serving`	Endpoint live, accepting requests
`unhealthy`	Health check failed (auto-recovers)
`stopped`	Stopped by user
`failed`	Deployment failed; check `error` field

Documentation Index

​Overview

​Hosted Deployment

​Create a deployment

​Deployment lifecycle

​Send inference requests

​Monitor your deployment

​Stop a deployment

​Billing

​Self-Hosted Deployment

​Checkpoint destinations

​Download the checkpoint

​Recommended serving stacks

​Example: serve with vLLM

​Deployment statuses