Evaluate

Evals — quality gates before production

Evals in Quantlix are not a separate scoring tool. They are quality gates inside the AI Runtime Control Plane: tied to deployments, policies, and traces, and run before a change is allowed to reach production.

What an eval is

An eval is a saved set of test inputs run against a workflow or deployment, with one or more scoring dimensions on the output. Suites group evals so you can compare versions side by side and treat regressions like failed tests.

Evals are part of the Evaluate pillar in the runtime control plane. They live next to your deployments and policies, not in a separate platform.

Quality gates, not just scores

The point of an eval is to block bad changes from advancing. A score on its own is a report; a gate is a decision. In Quantlix, evals can:

  • Fail a deployment if pass rate drops below a threshold.
  • Require a human approval if specific examples regress.
  • Promote a candidate workflow only when it beats the current production version.
  • Surface a regression in observability so the trace and the failed eval are linked.

When to use evals

Before a workflow change ships

Run the regression suite against the new version. If pass rate drops or specific examples fail, block the rollout.

After a model upgrade

Compare the old and new model on the same suite. Surface where the new model regresses, not just where it improves.

When a customer reports a bad answer

Add the failing case to the suite. Future changes have to keep passing it before they reach production.

On a schedule

Run drift detection against the live deployment so quality regressions surface within hours, not weeks.

Scoring dimensions

Quality is not one number. Quantlix surfaces several axes per run so a regression in one dimension is not hidden by an improvement in another.

DimensionQuestion it answers
GroundednessDid the answer cite the supporting source?
Citation precisionAre the cited sources actually relevant?
SafetyDid the response trip a safety check or policy?
SchemaDid the output match the expected shape?
LatencyDid the run finish under the budget?
CostDid the run stay under the per-request limit?

Evals measure quality. Policies enforce what AI is allowed to do. The two are separate but related — see runtime policies for the enforcement side.

Where to start