Evaluate
Evals — quality gates before production
Evals in Quantlix are not a separate scoring tool. They are quality gates inside the AI Runtime Control Plane: tied to deployments, policies, and traces, and run before a change is allowed to reach production.
What an eval is
An eval is a saved set of test inputs run against a workflow or deployment, with one or more scoring dimensions on the output. Suites group evals so you can compare versions side by side and treat regressions like failed tests.
Evals are part of the Evaluate pillar in the runtime control plane. They live next to your deployments and policies, not in a separate platform.
Quality gates, not just scores
The point of an eval is to block bad changes from advancing. A score on its own is a report; a gate is a decision. In Quantlix, evals can:
- Fail a deployment if pass rate drops below a threshold.
- Require a human approval if specific examples regress.
- Promote a candidate workflow only when it beats the current production version.
- Surface a regression in observability so the trace and the failed eval are linked.
When to use evals
Before a workflow change ships
Run the regression suite against the new version. If pass rate drops or specific examples fail, block the rollout.
After a model upgrade
Compare the old and new model on the same suite. Surface where the new model regresses, not just where it improves.
When a customer reports a bad answer
Add the failing case to the suite. Future changes have to keep passing it before they reach production.
On a schedule
Run drift detection against the live deployment so quality regressions surface within hours, not weeks.
Scoring dimensions
Quality is not one number. Quantlix surfaces several axes per run so a regression in one dimension is not hidden by an improvement in another.
| Dimension | Question it answers |
|---|---|
| Groundedness | Did the answer cite the supporting source? |
| Citation precision | Are the cited sources actually relevant? |
| Safety | Did the response trip a safety check or policy? |
| Schema | Did the output match the expected shape? |
| Latency | Did the run finish under the budget? |
| Cost | Did the run stay under the per-request limit? |
Evals measure quality. Policies enforce what AI is allowed to do. The two are separate but related — see runtime policies for the enforcement side.
Where to start
- Open the evals workspace to create your first suite.
- Read the observability guide — eval results show up alongside traces.
- See the AI Runtime Control Plane explainer for how evals fit with deployments, policies, and providers.