Cost and usage control

Budget control

Quantlix budget controls cap cost and usage at the platform boundary, before unexpected traffic, retries, slow tools, or agent loops surprise finance.

Request rate

Caps how many requests can run per minute for a user or deployment before usage spikes.

Compute ceiling

Limits how long a request can consume runtime resources before it is blocked or stopped.

Retry amplification

Prevents retries from multiplying cost after provider failures, tool failures, or timeouts.

Route-level design

Uses cheaper deployments, routers, approvals, and agent iteration limits before expensive branches run.

Setup path

  1. Identify the expensive path: chat model, RAG query, native agent, tool call, or workflow branch.
  2. Choose the enforcement boundary: deployment config for direct model calls, or workflow policy nodes before expensive steps.
  3. Set a request-rate ceiling for expected traffic plus a small buffer.
  4. Set compute and retry ceilings based on your slowest acceptable production request.
  5. Use the cost-sensitive enforcement pack as a starting point when you want standard defaults.
  6. Test valid traffic and overload traffic, then inspect observability for budget outcomes and cost estimates.

Examples

Basic deployment budget

Cap direct provider-backed model calls before unexpected traffic reaches the model.

{
  "pipeline_lock": {
    "contract_version": "1.0",
    "mode": "enforce",
    "schema": {
      "strict": true,
      "input_schema": {
        "type": "object",
        "required": ["prompt"],
        "properties": { "prompt": { "type": "string" } },
        "additionalProperties": false
      }
    },
    "policies": {
      "actions": { "on_violation": "block", "emit_event": true },
      "budget": {
        "request_rate_per_minute": 60,
        "max_compute_per_request_seconds": 120,
        "retry_cost_multiplier_ceiling": 2.0
      }
    }
  }
}

Cost-sensitive pack

Use the preset when you want rate, compute, and retry ceilings without hand-writing the full policy.

curl -X POST https://api.quantlix.ai/deployments/DEPLOYMENT_ID/apply-pack \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"pack_id": "cost-sensitive"}'

Agent budget pattern

Keep native tool-calling agents bounded so a tool loop cannot surprise finance.

{
  "agent": {
    "deployment_id": "dep_reasoning_model",
    "prompt_field": "question",
    "max_iterations": 4,
    "tools": [
      {
        "name": "lookup_account",
        "description": "Fetch account context",
        "input_schema": {
          "type": "object",
          "properties": { "account_id": { "type": "string" } },
          "required": ["account_id"]
        },
        "function": {
          "execution_type": "http",
          "method": "POST",
          "endpoint": "https://crm.internal/account",
          "timeout_ms": 5000
        }
      }
    ]
  }
}

Workflow cost pattern

Route cheap classification before expensive retrieval, tools, or stronger models.

input
  -> policy_check
  -> router / condition
  -> cheap classifier model
  -> retrieval or tool_call only when needed
  -> final model
  -> output

Finance and platform questions

Is this billing?

No. Budget gates are runtime controls. They reduce runaway usage, while invoices still come from Quantlix plan terms and the providers you connect.

Can it stop provider spend?

It can stop requests before provider inference when configured at the Quantlix boundary. Provider-side usage is still governed by your provider account.

Can I prove what was blocked?

Yes. Observability and enforcement events show budget decisions, run status, trace IDs, and cost estimates where available.

Does this apply to agents?

Yes. Combine deployment budget policies with agent max_iterations, function timeouts, retry limits, and approval gates.

What to verify

  • Over-limit traffic blocks or downgrades according to policy before provider inference.
  • Retry amplification is capped during provider, retrieval, or tool failures.
  • Agent loops stop at max_iterations and function timeouts are set for external calls.
  • Observability shows budget outcomes, run status, latency, and cost estimates.
  • Finance understands that budget gates are controls, not a replacement for provider billing limits.