Testing a Behavior

Whenever you change a specialty, add an operational rule, or attach a new MCP, you want to know the change is better — not worse. Three layers, from quickest to most rigorous.

1. Individual task

You don’t have to wait for a real ticket. Create a new specialty, attach it to an agent, and send a task directly:

“Add 4GB of RAM on the machine SNDX-EUW3-DOCKER”

Read the commands the agent ran, how it handled unexpected problems, and how it reached a resolution. Compare against an agent using the previous specialty. Quick, cheap, and surfaces obvious regressions immediately.

2. Read-only / investigate mode

By default, agents run in remediate mode. Adding @2501:investigate to a task or pinning a specialty to investigate-only keeps it read-only — the secondary engine blocks any command that would alter the system. Use this for a plan-then-apply flow:

Tag the task @2501:investigate.
Ask: “Craft me a plan of actions to resolve this issue.”
Read the agent’s plan, its inspections, the constraints it cited from operational rules.
Tweak the prompt or rules if needed.
Re-run as a remediation task once confident.

This is the safest way to introduce a new behavior to a critical system.

3. Sandbox & Benchmarking

The most rigorous path: replicate a realistic ticket in a sandbox with Ansible playbooks, then run the agent against it dozens or hundreds of times. Benchmark evaluates two things independently:

Score	Question
Pass rate	Did the agent actually fix the problem? (ground-truth check against host state)
Compliance	Did the agent follow your processes? (commands, summaries, operational-rule injections)

A scenario passes only when both gates pass. An agent can fix a problem the wrong way; an agent can do everything right and still leave the system broken. The split catches both. The Benchmarks page in Command Center surfaces pass rate, compliance scores, and trends over time — so you can see whether your latest specialty edit improved things or regressed.

When to use which

Situation	Use
Small specialty edit, trivial change	Individual task
New procedural operational rule	Investigate mode + 1-2 individual tasks
New specialty for a critical domain	All three — task → investigate → benchmark
Production rollout of a new ticket typology	Benchmark with multiple iterations before going live

Prompting

Best Practices

Risk & Safety

FAQ

Testing a Behavior

1. Individual task

2. Read-only / investigate mode

3. Sandbox & Benchmarking

When to use which

​1. Individual task

​2. Read-only / investigate mode

​3. Sandbox & Benchmarking

​When to use which

1. Individual task

2. Read-only / investigate mode

3. Sandbox & Benchmarking

When to use which