> ## Documentation Index > Fetch the complete documentation index at: https://docs.2501.ai/llms.txt > Use this file to discover all available pages before exploring further. # Why Benchmark > Validate agent behavior on realistic scenarios before exposing it to production Production is the wrong place to test a new agent. **Benchmark** runs your agents through realistic, reproducible scenarios in a sandbox — a broken nginx, a CrashLoopBackOff pod, a filling disk — and gives you two scores: did the agent **fix it**, and did it **fix it the right way**. ## Why benchmark Before letting a new specialty handle real tickets, replicate the failure mode in a sandbox and see how the agent handles it across dozens of runs. A previously-passing scenario starts failing after a specialty edit. Trend lines surface it immediately. Same scenario, different `--main-engine`. Same scenario, different specialty. The compliance score tells you which behaves better, not just which is faster. Auditors get a record: which actions the agent took, which operational rules it followed, which commands it avoided. ## The two scores Every scenario produces two **independent** scores. | Score | Question | How it's measured | | -------------- | --------------------------------------- | ----------------------------------------------------------------------------------------------------- | | **Pass rate** | Did the agent actually fix the problem? | Ground-truth check against the host's state after the agent finishes (typically an Ansible playbook). | | **Compliance** | Did the agent follow your process? | Pattern-matching against executed commands, task summaries, the agent's reasoning, injected rules. | A scenario **passes only when both gates pass**. The split is intentional: * An agent can fix a problem in a way that **violates your processes** — passes task validation, fails compliance. * An agent can follow every right step and **still leave the system broken** — passes compliance, fails task validation. You want both green. ## How a scenario runs ``` Pre-flight ─▶ Provision ─▶ Prepare ─▶ Execute ─▶ Validate ─▶ Restore ─▶ Report │ │ │ │ │ │ │ verify resolve run dispatch score the reset the write env + DB host + prepare.yml ticket to run host or ScenarioReport agent (introduce the agent tear down to the DB the failure) the VM ``` The runner mirrors how real tickets flow through 2501 — same gateway, same agent, same orchestrator — but with a controlled environment and a scoring harness around it. See [Playbooks](/0.8/benchmark/playbooks) for the execution diagram in full. ## Where benchmarks run | Mode | When to use | | --------------------------- | -------------------------------------------------------------------------------- | | **`--mode host`** (default) | Pre-provisioned VMs you maintain. Simplest setup, most production-like. | | **`--mode lima`** | macOS sandbox. The runner provisions a fresh Lima VM per run, destroys it after. | | **`--mode incus`** | Linux sandbox. Same idea as lima but with Incus. | Both VM modes let you run scenarios `--parallel` for throughput, and guarantee every run starts from an identical, known-good state. See [VM Sandbox](/0.8/benchmark/sandbox) for setup. ## Prerequisites * **A running 2501 instance** reachable from the runner machine (the runner connects to the database via `DATABASE_URL` and optionally to a gateway or the engine API). * **Ansible** installed and on PATH — playbooks drive prepare / validate / restore. * **SSH access** from the runner machine to your sandbox hosts (or Lima/Incus installed for VM modes). * **Scenarios** in a directory the runner can find (auto-detected, or set with `-p`). ## What to read next Run your first scenario in 5 minutes. Write `scenario.json`, hosts, agents, playbooks, validators. The `2501 runner start / validate / flush` reference. Ephemeral VMs for fully reproducible runs.