Skip to main content
When a new ticket typology arrives, you don’t have to test it in production. Replicate it with Ansible playbooks on a sandbox host and evaluate the agent there.

What gets measured

Each scenario run produces two independent scores:
ScoreQuestion
Pass rateDid the agent actually fix the problem?
ComplianceDid the agent follow your process?
A scenario passes only when both gates pass. An agent can fix a problem the wrong way, or follow every rule and still leave the system broken — the split catches both.

Pass rate

Verifies the end result, regardless of how the agent got there.
  • Ansible-based ground-truth check — after the agent finishes, an Ansible playbook inspects the host (service status, file contents, port responsiveness). Most reliable.
  • Output-based check — if there’s nothing to verify on the host (e.g. for pure investigation tickets), measure the quality of the report.

Compliance

Verifies the agent did it the way you expected.
  • Which actions were taken
  • Which tools were used (and which weren’t)
  • Specific words in the task summary
  • Which operational rules were injected
  • How many tasks the job needed
Compliance lets auditors confirm agents follow company practice — useful for regulated industries.

Automation

The benchmark runner can execute scenarios on a regular schedule. Results land in Command Center → Benchmarks with pass rate, compliance, and trends over time, so you can spot regressions early. Recommended cadence: weekly nightly run of your full scenario suite, plus per-PR runs of the scenarios touching changed specialties or rules.

What benchmarking is good for

Use caseWhy
Validating a new ticket typology before exposing to prodReplicate the failure mode, see how the agent handles it
Comparing two specialties side-by-sideSame scenario, different specialty, see which passes faster + cleaner
Catching regressions after a rule editTrend lines surface failures the moment a previously-passing scenario starts failing
Tuning LLM models — Sonnet vs Opus vs your ownSame scenario across multiple --main-engine overrides
See Benchmark for the runner CLI, scenario format, and validators.