Benchmarking for Risk

When a new ticket typology arrives, you don’t have to test it in production. Replicate it with Ansible playbooks on a sandbox host and evaluate the agent there.

What gets measured

Each scenario run produces two independent scores:

Score	Question
Pass rate	Did the agent actually fix the problem?
Compliance	Did the agent follow your process?

A scenario passes only when both gates pass. An agent can fix a problem the wrong way, or follow every rule and still leave the system broken — the split catches both.

Pass rate

Verifies the end result, regardless of how the agent got there.

Ansible-based ground-truth check — after the agent finishes, an Ansible playbook inspects the host (service status, file contents, port responsiveness). Most reliable.
Output-based check — if there’s nothing to verify on the host (e.g. for pure investigation tickets), measure the quality of the report.

Compliance

Verifies the agent did it the way you expected.

Which actions were taken
Which tools were used (and which weren’t)
Specific words in the task summary
Which operational rules were injected
How many tasks the job needed

Compliance lets auditors confirm agents follow company practice — useful for regulated industries.

Automation

The benchmark runner can execute scenarios on a regular schedule. Results land in Command Center → Benchmarks with pass rate, compliance, and trends over time, so you can spot regressions early. Recommended cadence: weekly nightly run of your full scenario suite, plus per-PR runs of the scenarios touching changed specialties or rules.

What benchmarking is good for

Use case	Why
Validating a new ticket typology before exposing to prod	Replicate the failure mode, see how the agent handles it
Comparing two specialties side-by-side	Same scenario, different specialty, see which passes faster + cleaner
Catching regressions after a rule edit	Trend lines surface failures the moment a previously-passing scenario starts failing
Tuning LLM models — Sonnet vs Opus vs your own	Same scenario across multiple `--main-engine` overrides

See Benchmark for the runner CLI, scenario format, and validators.

Prompting

Best Practices

Risk & Safety

FAQ

Benchmarking for Risk

What gets measured

Pass rate

Compliance

Automation

What benchmarking is good for

​What gets measured

​Pass rate

​Compliance

​Automation

​What benchmarking is good for

What gets measured

Pass rate

Compliance

Automation

What benchmarking is good for