Evaluate

Evals — quality gates before production

Evals in Quantlix are not a separate scoring tool. They are quality gates inside the AI Runtime Control Plane: tied to deployments, policies, and traces, and run before a change is allowed to reach production.

What an eval is

An eval is a saved set of test inputs run against a workflow or deployment, with one or more scoring dimensions on the output. Suites group evals so you can compare versions side by side and treat regressions like failed tests.

Evals are part of the Evaluate pillar in the runtime control plane. They live next to your deployments and policies, not in a separate platform.

Quality gates, not just scores

The point of an eval is to block bad changes from advancing. A score on its own is a report; a gate is a decision. In Quantlix, evals can:

Fail a deployment if pass rate drops below a threshold.
Require a human approval if specific examples regress.
Promote a candidate workflow only when it beats the current production version.
Surface a regression in observability so the trace and the failed eval are linked.

When to use evals

Before a workflow change ships

Run the regression suite against the new version. If pass rate drops or specific examples fail, block the rollout.

After a model upgrade

Compare the old and new model on the same suite. Surface where the new model regresses, not just where it improves.

When a customer reports a bad answer

Add the failing case to the suite. Future changes have to keep passing it before they reach production.

On a schedule

Run drift detection against the live deployment so quality regressions surface within hours, not weeks.

Scoring dimensions

Quality is not one number. Quantlix surfaces several axes per run so a regression in one dimension is not hidden by an improvement in another.

Dimension	Question it answers
Groundedness	Did the answer cite the supporting source?
Citation precision	Are the cited sources actually relevant?
Safety	Did the response trip a safety check or policy?
Schema	Did the output match the expected shape?
Latency	Did the run finish under the budget?
Cost	Did the run stay under the per-request limit?

Evals measure quality. Policies enforce what AI is allowed to do. The two are separate but related — see runtime policies for the enforcement side.

Where to start

Open the evals workspace to create your first suite.
Read the observability guide — eval results show up alongside traces.
See the AI Runtime Control Plane explainer for how evals fit with deployments, policies, and providers.