What gets measured
Each scenario run produces two independent scores:| Score | Question |
|---|---|
| Pass rate | Did the agent actually fix the problem? |
| Compliance | Did the agent follow your process? |
Pass rate
Verifies the end result, regardless of how the agent got there.- Ansible-based ground-truth check — after the agent finishes, an Ansible playbook inspects the host (service status, file contents, port responsiveness). Most reliable.
- Output-based check — if there’s nothing to verify on the host (e.g. for pure investigation tickets), measure the quality of the report.
Compliance
Verifies the agent did it the way you expected.- Which actions were taken
- Which tools were used (and which weren’t)
- Specific words in the task summary
- Which operational rules were injected
- How many tasks the job needed
Automation
The benchmark runner can execute scenarios on a regular schedule. Results land in Command Center → Benchmarks with pass rate, compliance, and trends over time, so you can spot regressions early. Recommended cadence: weekly nightly run of your full scenario suite, plus per-PR runs of the scenarios touching changed specialties or rules.What benchmarking is good for
| Use case | Why |
|---|---|
| Validating a new ticket typology before exposing to prod | Replicate the failure mode, see how the agent handles it |
| Comparing two specialties side-by-side | Same scenario, different specialty, see which passes faster + cleaner |
| Catching regressions after a rule edit | Trend lines surface failures the moment a previously-passing scenario starts failing |
| Tuning LLM models — Sonnet vs Opus vs your own | Same scenario across multiple --main-engine overrides |

