Why benchmark
Validate new ticket typologies
Before letting a new specialty handle real tickets, replicate the failure mode in a sandbox and see how the agent handles it across dozens of runs.
Catch regressions
A previously-passing scenario starts failing after a specialty edit. Trend lines surface it immediately.
Compare models or prompts
Same scenario, different
--main-engine. Same scenario, different specialty. The compliance score tells you which behaves better, not just which is faster.Prove compliance to your team
Auditors get a record: which actions the agent took, which operational rules it followed, which commands it avoided.
The two scores
Every scenario produces two independent scores.| Score | Question | How it’s measured |
|---|---|---|
| Pass rate | Did the agent actually fix the problem? | Ground-truth check against the host’s state after the agent finishes (typically an Ansible playbook). |
| Compliance | Did the agent follow your process? | Pattern-matching against executed commands, task summaries, the agent’s reasoning, injected rules. |
- An agent can fix a problem in a way that violates your processes — passes task validation, fails compliance.
- An agent can follow every right step and still leave the system broken — passes compliance, fails task validation.
How a scenario runs
Where benchmarks run
| Mode | When to use |
|---|---|
--mode host (default) | Pre-provisioned VMs you maintain. Simplest setup, most production-like. |
--mode lima | macOS sandbox. The runner provisions a fresh Lima VM per run, destroys it after. |
--mode incus | Linux sandbox. Same idea as lima but with Incus. |
--parallel for throughput, and guarantee every run starts from an identical, known-good state. See VM Sandbox for setup.
Prerequisites
- A running 2501 instance reachable from the runner machine (the runner connects to the database via
DATABASE_URLand optionally to a gateway or the engine API). - Ansible installed and on PATH — playbooks drive prepare / validate / restore.
- SSH access from the runner machine to your sandbox hosts (or Lima/Incus installed for VM modes).
- Scenarios in a directory the runner can find (auto-detected, or set with
-p).
What to read next
Quickstart
Run your first scenario in 5 minutes.
Authoring scenarios
Write
scenario.json, hosts, agents, playbooks, validators.Running
The
2501 runner start / validate / flush reference.VM Sandbox
Ephemeral VMs for fully reproducible runs.

