> ## Documentation Index
> Fetch the complete documentation index at: https://docs.2501.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Why Benchmark

> Validate agent behavior on realistic scenarios before exposing it to production

Production is the wrong place to test a new agent. **Benchmark** runs your agents through realistic, reproducible scenarios in a sandbox — a broken nginx, a CrashLoopBackOff pod, a filling disk — and gives you two scores: did the agent **fix it**, and did it **fix it the right way**.

## Why benchmark

<CardGroup cols={2}>
  <Card title="Validate new ticket typologies" icon="vial">
    Before letting a new specialty handle real tickets, replicate the failure mode in a sandbox and see how the agent handles it across dozens of runs.
  </Card>

  <Card title="Catch regressions" icon="chart-line">
    A previously-passing scenario starts failing after a specialty edit. Trend lines surface it immediately.
  </Card>

  <Card title="Compare models or prompts" icon="scale-balanced">
    Same scenario, different `--main-engine`. Same scenario, different specialty. The compliance score tells you which behaves better, not just which is faster.
  </Card>

  <Card title="Prove compliance to your team" icon="clipboard-check">
    Auditors get a record: which actions the agent took, which operational rules it followed, which commands it avoided.
  </Card>
</CardGroup>

## The two scores

Every scenario produces two **independent** scores.

| Score          | Question                                | How it's measured                                                                                     |
| -------------- | --------------------------------------- | ----------------------------------------------------------------------------------------------------- |
| **Pass rate**  | Did the agent actually fix the problem? | Ground-truth check against the host's state after the agent finishes (typically an Ansible playbook). |
| **Compliance** | Did the agent follow your process?      | Pattern-matching against executed commands, task summaries, the agent's reasoning, injected rules.    |

A scenario **passes only when both gates pass**. The split is intentional:

* An agent can fix a problem in a way that **violates your processes** — passes task validation, fails compliance.
* An agent can follow every right step and **still leave the system broken** — passes compliance, fails task validation.

You want both green.

## How a scenario runs

```
Pre-flight ─▶ Provision ─▶ Prepare ─▶ Execute ─▶ Validate ─▶ Restore ─▶ Report
   │            │            │           │           │           │           │
   verify       resolve      run         dispatch    score the   reset the   write
   env + DB     host +       prepare.yml ticket to   run         host or     ScenarioReport
                agent        (introduce  the agent              tear down   to the DB
                             the failure)                        the VM
```

The runner mirrors how real tickets flow through 2501 — same gateway, same agent, same orchestrator — but with a controlled environment and a scoring harness around it. See [Playbooks](/0.8/benchmark/playbooks) for the execution diagram in full.

## Where benchmarks run

| Mode                        | When to use                                                                      |
| --------------------------- | -------------------------------------------------------------------------------- |
| **`--mode host`** (default) | Pre-provisioned VMs you maintain. Simplest setup, most production-like.          |
| **`--mode lima`**           | macOS sandbox. The runner provisions a fresh Lima VM per run, destroys it after. |
| **`--mode incus`**          | Linux sandbox. Same idea as lima but with Incus.                                 |

Both VM modes let you run scenarios `--parallel` for throughput, and guarantee every run starts from an identical, known-good state. See [VM Sandbox](/0.8/benchmark/sandbox) for setup.

## Prerequisites

* **A running 2501 instance** reachable from the runner machine (the runner connects to the database via `DATABASE_URL` and optionally to a gateway or the engine API).
* **Ansible** installed and on PATH — playbooks drive prepare / validate / restore.
* **SSH access** from the runner machine to your sandbox hosts (or Lima/Incus installed for VM modes).
* **Scenarios** in a directory the runner can find (auto-detected, or set with `-p`).

## What to read next

<CardGroup cols={2}>
  <Card title="Quickstart" icon="play" href="/0.8/benchmark/quickstart">
    Run your first scenario in 5 minutes.
  </Card>

  <Card title="Authoring scenarios" icon="pen-ruler" href="/0.8/benchmark/scenario">
    Write `scenario.json`, hosts, agents, playbooks, validators.
  </Card>

  <Card title="Running" icon="terminal" href="/0.8/benchmark/start">
    The `2501 runner start / validate / flush` reference.
  </Card>

  <Card title="VM Sandbox" icon="cube" href="/0.8/benchmark/sandbox">
    Ephemeral VMs for fully reproducible runs.
  </Card>
</CardGroup>
