Skip to main content
Whenever you change a specialty, add an operational rule, or attach a new MCP, you want to know the change is better — not worse. Three layers, from quickest to most rigorous.

1. Individual task

You don’t have to wait for a real ticket. Create a new specialty, attach it to an agent, and send a task directly:
“Add 4GB of RAM on the machine SNDX-EUW3-DOCKER”
Read the commands the agent ran, how it handled unexpected problems, and how it reached a resolution. Compare against an agent using the previous specialty. Quick, cheap, and surfaces obvious regressions immediately.

2. Read-only / investigate mode

By default, agents run in remediate mode. Adding @2501:investigate to a task or pinning a specialty to investigate-only keeps it read-only — the secondary engine blocks any command that would alter the system. Use this for a plan-then-apply flow:
  1. Tag the task @2501:investigate.
  2. Ask: “Craft me a plan of actions to resolve this issue.”
  3. Read the agent’s plan, its inspections, the constraints it cited from operational rules.
  4. Tweak the prompt or rules if needed.
  5. Re-run as a remediation task once confident.
This is the safest way to introduce a new behavior to a critical system.

3. Sandbox & Benchmarking

The most rigorous path: replicate a realistic ticket in a sandbox with Ansible playbooks, then run the agent against it dozens or hundreds of times. Benchmark evaluates two things independently:
ScoreQuestion
Pass rateDid the agent actually fix the problem? (ground-truth check against host state)
ComplianceDid the agent follow your processes? (commands, summaries, operational-rule injections)
A scenario passes only when both gates pass. An agent can fix a problem the wrong way; an agent can do everything right and still leave the system broken. The split catches both. The Benchmarks page in Command Center surfaces pass rate, compliance scores, and trends over time — so you can see whether your latest specialty edit improved things or regressed.

When to use which

SituationUse
Small specialty edit, trivial changeIndividual task
New procedural operational ruleInvestigate mode + 1-2 individual tasks
New specialty for a critical domainAll three — task → investigate → benchmark
Production rollout of a new ticket typologyBenchmark with multiple iterations before going live