Skip to main content
Production is the wrong place to test a new agent. Benchmark runs your agents through realistic, reproducible scenarios in a sandbox — a broken nginx, a CrashLoopBackOff pod, a filling disk — and gives you two scores: did the agent fix it, and did it fix it the right way.

Why benchmark

Validate new ticket typologies

Before letting a new specialty handle real tickets, replicate the failure mode in a sandbox and see how the agent handles it across dozens of runs.

Catch regressions

A previously-passing scenario starts failing after a specialty edit. Trend lines surface it immediately.

Compare models or prompts

Same scenario, different --main-engine. Same scenario, different specialty. The compliance score tells you which behaves better, not just which is faster.

Prove compliance to your team

Auditors get a record: which actions the agent took, which operational rules it followed, which commands it avoided.

The two scores

Every scenario produces two independent scores.
ScoreQuestionHow it’s measured
Pass rateDid the agent actually fix the problem?Ground-truth check against the host’s state after the agent finishes (typically an Ansible playbook).
ComplianceDid the agent follow your process?Pattern-matching against executed commands, task summaries, the agent’s reasoning, injected rules.
A scenario passes only when both gates pass. The split is intentional:
  • An agent can fix a problem in a way that violates your processes — passes task validation, fails compliance.
  • An agent can follow every right step and still leave the system broken — passes compliance, fails task validation.
You want both green.

How a scenario runs

Pre-flight ─▶ Provision ─▶ Prepare ─▶ Execute ─▶ Validate ─▶ Restore ─▶ Report
   │            │            │           │           │           │           │
   verify       resolve      run         dispatch    score the   reset the   write
   env + DB     host +       prepare.yml ticket to   run         host or     ScenarioReport
                agent        (introduce  the agent              tear down   to the DB
                             the failure)                        the VM
The runner mirrors how real tickets flow through 2501 — same gateway, same agent, same orchestrator — but with a controlled environment and a scoring harness around it. See Playbooks for the execution diagram in full.

Where benchmarks run

ModeWhen to use
--mode host (default)Pre-provisioned VMs you maintain. Simplest setup, most production-like.
--mode limamacOS sandbox. The runner provisions a fresh Lima VM per run, destroys it after.
--mode incusLinux sandbox. Same idea as lima but with Incus.
Both VM modes let you run scenarios --parallel for throughput, and guarantee every run starts from an identical, known-good state. See VM Sandbox for setup.

Prerequisites

  • A running 2501 instance reachable from the runner machine (the runner connects to the database via DATABASE_URL and optionally to a gateway or the engine API).
  • Ansible installed and on PATH — playbooks drive prepare / validate / restore.
  • SSH access from the runner machine to your sandbox hosts (or Lima/Incus installed for VM modes).
  • Scenarios in a directory the runner can find (auto-detected, or set with -p).

Quickstart

Run your first scenario in 5 minutes.

Authoring scenarios

Write scenario.json, hosts, agents, playbooks, validators.

Running

The 2501 runner start / validate / flush reference.

VM Sandbox

Ephemeral VMs for fully reproducible runs.