> ## Documentation Index
> Fetch the complete documentation index at: https://docs.2501.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Testing a Behavior

> Verify a new specialty, rule, or MCP before rolling it out to production

Whenever you change a specialty, add an operational rule, or attach a new MCP, you want to know the change is better — not worse. Three layers, from quickest to most rigorous.

## 1. Individual task

You don't have to wait for a real ticket. Create a new specialty, attach it to an agent, and send a task directly:

> "Add 4GB of RAM on the machine SNDX-EUW3-DOCKER"

Read the commands the agent ran, how it handled unexpected problems, and how it reached a resolution. Compare against an agent using the previous specialty.

Quick, cheap, and surfaces obvious regressions immediately.

## 2. Read-only / investigate mode

By default, agents run in remediate mode. Adding `@2501:investigate` to a task or pinning a specialty to **investigate-only** keeps it read-only — the secondary engine blocks any command that would alter the system.

Use this for a **plan-then-apply** flow:

1. Tag the task `@2501:investigate`.
2. Ask: "Craft me a plan of actions to resolve this issue."
3. Read the agent's plan, its inspections, the constraints it cited from operational rules.
4. Tweak the prompt or rules if needed.
5. Re-run as a remediation task once confident.

This is the safest way to introduce a new behavior to a critical system.

## 3. Sandbox & Benchmarking

The most rigorous path: replicate a realistic ticket in a sandbox with Ansible playbooks, then run the agent against it dozens or hundreds of times.

[Benchmark](/0.8/benchmark/overview) evaluates two things independently:

| Score          | Question                                                                                |
| -------------- | --------------------------------------------------------------------------------------- |
| **Pass rate**  | Did the agent actually fix the problem? (ground-truth check against host state)         |
| **Compliance** | Did the agent follow your processes? (commands, summaries, operational-rule injections) |

A scenario passes only when both gates pass. An agent can fix a problem the wrong way; an agent can do everything right and still leave the system broken. The split catches both.

The Benchmarks page in Command Center surfaces pass rate, compliance scores, and trends over time — so you can see whether your latest specialty edit improved things or regressed.

## When to use which

| Situation                                   | Use                                                  |
| ------------------------------------------- | ---------------------------------------------------- |
| Small specialty edit, trivial change        | Individual task                                      |
| New procedural operational rule             | Investigate mode + 1-2 individual tasks              |
| New specialty for a critical domain         | All three — task → investigate → benchmark           |
| Production rollout of a new ticket typology | Benchmark with multiple iterations before going live |
