> ## Documentation Index
> Fetch the complete documentation index at: https://docs.2501.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Benchmarking for Risk

> Validate new use cases in a sandbox before exposing them to production

When a new ticket typology arrives, you don't have to test it in production. Replicate it with Ansible playbooks on a sandbox host and evaluate the agent there.

## What gets measured

Each scenario run produces two **independent** scores:

| Score          | Question                                |
| -------------- | --------------------------------------- |
| **Pass rate**  | Did the agent actually fix the problem? |
| **Compliance** | Did the agent follow your process?      |

A scenario passes only when **both** gates pass. An agent can fix a problem the wrong way, or follow every rule and still leave the system broken — the split catches both.

### Pass rate

Verifies the end result, regardless of how the agent got there.

* **Ansible-based ground-truth check** — after the agent finishes, an Ansible playbook inspects the host (service status, file contents, port responsiveness). Most reliable.
* **Output-based check** — if there's nothing to verify on the host (e.g. for pure investigation tickets), measure the quality of the report.

### Compliance

Verifies the agent did it the way you expected.

* Which actions were taken
* Which tools were used (and which weren't)
* Specific words in the task summary
* Which operational rules were injected
* How many tasks the job needed

Compliance lets auditors confirm agents follow company practice — useful for regulated industries.

## Automation

The benchmark runner can execute scenarios on a regular schedule. Results land in **Command Center → Benchmarks** with pass rate, compliance, and trends over time, so you can spot regressions early.

Recommended cadence: **weekly nightly run** of your full scenario suite, plus **per-PR runs** of the scenarios touching changed specialties or rules.

## What benchmarking is good for

| Use case                                                 | Why                                                                                  |
| -------------------------------------------------------- | ------------------------------------------------------------------------------------ |
| Validating a new ticket typology before exposing to prod | Replicate the failure mode, see how the agent handles it                             |
| Comparing two specialties side-by-side                   | Same scenario, different specialty, see which passes faster + cleaner                |
| Catching regressions after a rule edit                   | Trend lines surface failures the moment a previously-passing scenario starts failing |
| Tuning LLM models — Sonnet vs Opus vs your own           | Same scenario across multiple `--main-engine` overrides                              |

See [Benchmark](/0.8/benchmark/overview) for the runner CLI, scenario format, and validators.