1. Individual task
You don’t have to wait for a real ticket. Create a new specialty, attach it to an agent, and send a task directly:“Add 4GB of RAM on the machine SNDX-EUW3-DOCKER”Read the commands the agent ran, how it handled unexpected problems, and how it reached a resolution. Compare against an agent using the previous specialty. Quick, cheap, and surfaces obvious regressions immediately.
2. Read-only / investigate mode
By default, agents run in remediate mode. Adding@2501:investigate to a task or pinning a specialty to investigate-only keeps it read-only — the secondary engine blocks any command that would alter the system.
Use this for a plan-then-apply flow:
- Tag the task
@2501:investigate. - Ask: “Craft me a plan of actions to resolve this issue.”
- Read the agent’s plan, its inspections, the constraints it cited from operational rules.
- Tweak the prompt or rules if needed.
- Re-run as a remediation task once confident.
3. Sandbox & Benchmarking
The most rigorous path: replicate a realistic ticket in a sandbox with Ansible playbooks, then run the agent against it dozens or hundreds of times. Benchmark evaluates two things independently:| Score | Question |
|---|---|
| Pass rate | Did the agent actually fix the problem? (ground-truth check against host state) |
| Compliance | Did the agent follow your processes? (commands, summaries, operational-rule injections) |
When to use which
| Situation | Use |
|---|---|
| Small specialty edit, trivial change | Individual task |
| New procedural operational rule | Investigate mode + 1-2 individual tasks |
| New specialty for a critical domain | All three — task → investigate → benchmark |
| Production rollout of a new ticket typology | Benchmark with multiple iterations before going live |

