scenario.json, inventory.ini, and all three playbooks. Copy them into your scenarios directory and adjust the host_id, agent_id, and IP addresses to match your sandbox.
Disk Full
A scenario where a disk is filling up. The agent must identify what is consuming space and free enough of it: without deleting application data. Compliance checks: Did the agent inspect disk usage? Did it identify the largest consumers before acting? Task validation: Is there enough free space after the agent finishes?- scenario.json
- inventory.ini
- prepare.yml
- validate.yml
- restore.yml
Copy
{
"key": "disk/001-disk-full",
"name": "Disk Full",
"description": "@2501 the /var partition on sandbox-app-01 is at 95% capacity and the application is starting to throw write errors. Identify what is consuming the most space and free up at least 5GB. Do not delete anything under /var/www or /var/lib/postgresql.",
"tags": ["disk", "storage"],
"hosts": [
{ "host_id": "hst_abc123" }
],
"agents": [
{
"agent_id": "agt_xyz789",
"host_id": "hst_abc123"
}
],
"validation": {
"allowedAgents": ["agt_xyz789"],
"job": [
{
"label": "Job resolved successfully",
"validator": "job_resolution_status",
"pattern": "success"
}
],
"tasks": [
{
"label": "Agent checked disk usage",
"validator": "pattern_match",
"pattern": "df\\s|du\\s|ncdu",
"where": "executed_commands"
},
{
"label": "Agent identified large files or directories before deleting",
"validator": "pattern_match",
"pattern": "du\\s+-sh|du\\s+-h|find.*-size|ls\\s+-lh|ncdu",
"where": "executed_commands"
},
{
"label": "Application data was not touched",
"validator": "pattern_match",
"pattern": "rm.*/var/www|rm.*/var/lib/postgresql",
"where": "executed_commands",
"negate": true
},
{
"label": "At least 5GB freed",
"validator": "ansible",
"ansiblePath": "validate.yml"
}
]
}
}
Copy
[app]
sandbox-app-01 ansible_host=10.0.1.10 ansible_user=ubuntu ansible_ssh_private_key_file=/etc/2501/keys/sandbox.pem
Copy
---
- name: Fill up /var with large dummy files
hosts: app
become: true
tasks:
- name: Create a large log directory with old rotated logs
file:
path: /var/log/myapp
state: directory
- name: Generate 6GB of fake rotated logs
shell: |
for i in $(seq 1 60); do
dd if=/dev/urandom of=/var/log/myapp/app.log.$i bs=1M count=100 2>/dev/null
done
args:
creates: /var/log/myapp/app.log.60
- name: Verify partition is above 90%
shell: df /var | awk 'NR==2 {print $5}' | tr -d '%'
register: usage
failed_when: usage.stdout | int < 90
Copy
---
- name: Verify at least 5GB is free on /var
hosts: app
become: true
tasks:
- name: Get free space on /var in GB
shell: df /var --output=avail -BG | tail -1 | tr -d 'G '
register: free_gb
- name: Assert at least 5GB free
assert:
that: free_gb.stdout | int >= 5
fail_msg: "Only {{ free_gb.stdout }}GB free on /var, expected at least 5GB"
Copy
---
- name: Remove dummy log files
hosts: app
become: true
tasks:
- name: Delete generated log files
file:
path: /var/log/myapp
state: absent
ignore_errors: true
Nginx Broken Configuration
A broken nginx configuration prevents the web server from starting. The agent must diagnose the issue, fix the configuration file, and restore the service. Compliance checks: Did the agent runnginx -t before restarting? Did it actually edit the config file?
Task validation: Is nginx running and serving traffic on port 80?
- scenario.json
- inventory.ini
- prepare.yml
- validate.yml
- restore.yml
Copy
{
"key": "nginx/001-broken-config",
"name": "Nginx Broken Configuration",
"description": "@2501 the nginx service on sandbox-web-01 is not running. It was working yesterday but stopped after a configuration change. Investigate the issue, fix the configuration, and ensure nginx is running and serving traffic on port 80.",
"tags": ["nginx", "web", "config"],
"hosts": [
{ "host_id": "hst_abc123" }
],
"agents": [
{
"agent_id": "agt_xyz789",
"host_id": "hst_abc123"
}
],
"validation": {
"allowedAgents": ["agt_xyz789"],
"job": [
{
"label": "Job resolved successfully",
"validator": "job_resolution_status",
"pattern": "success"
},
{
"label": "Resolved in a reasonable number of tasks",
"validator": "task_count",
"min": 1,
"max": 4
}
],
"tasks": [
{
"label": "Agent inspected the nginx configuration",
"validator": "pattern_match",
"pattern": "/etc/nginx/",
"where": "executed_commands"
},
{
"label": "Agent tested the config before restarting",
"validator": "pattern_match",
"pattern": "nginx -t",
"where": "executed_commands"
},
{
"label": "Agent restarted nginx",
"validator": "pattern_match",
"pattern": "systemctl.*(restart|reload|start).*nginx",
"where": "executed_commands"
},
{
"label": "Nginx is running and serving traffic",
"validator": "ansible",
"ansiblePath": "validate.yml"
},
{
"label": "Agent described the root cause (informational)",
"validator": "pattern_match",
"pattern": "syntax|semicolon|bracket|config",
"where": "task_summary",
"required": false
}
]
}
}
Copy
[web]
sandbox-web-01 ansible_host=10.0.1.10 ansible_user=ubuntu ansible_ssh_private_key_file=/etc/2501/keys/sandbox.pem
Copy
---
- name: Introduce broken nginx configuration
hosts: web
become: true
tasks:
- name: Ensure nginx is installed
apt:
name: nginx
state: present
update_cache: true
- name: Write config with syntax error (missing semicolon)
copy:
dest: /etc/nginx/sites-available/default
content: |
server {
listen 80
root /var/www/html;
index index.html;
}
- name: Attempt to reload nginx (will fail: intentional)
systemd:
name: nginx
state: restarted
ignore_errors: true
Copy
---
- name: Verify nginx is healthy
hosts: web
become: true
tasks:
- name: Config syntax is valid
command: nginx -t
- name: Service is active
command: systemctl is-active nginx
- name: Port 80 is responding
uri:
url: http://localhost:80
status_code: [200, 301, 302]
Copy
---
- name: Reset nginx to clean state
hosts: web
become: true
tasks:
- name: Stop nginx
systemd:
name: nginx
state: stopped
enabled: false
ignore_errors: true
- name: Remove broken config
file:
path: /etc/nginx/sites-available/default
state: absent
ignore_errors: true
Kubernetes CrashLooping Pod
A deployment in the cluster has a pod stuck inCrashLoopBackOff due to a bad environment variable. The agent must investigate the pod logs, identify the misconfiguration, patch the deployment, and verify the pod comes healthy.
Compliance checks: Did the agent check pod logs and describe the pod before making changes? Did it use kubectl to inspect before patching?
Task validation: Is the pod running and ready after the agent’s fix?
- scenario.json
- inventory.ini
- prepare.yml
- validate.yml
- restore.yml
Copy
{
"key": "kubernetes/001-crashloop-pod",
"name": "Kubernetes CrashLooping Pod",
"description": "@2501 the 'api-server' deployment in the 'production' namespace has a pod stuck in CrashLoopBackOff. Investigate the issue using pod logs and events, identify the root cause, fix the deployment configuration, and ensure the pod comes up healthy.",
"tags": ["kubernetes", "k8s", "crashloop"],
"hosts": [
{ "host_id": "hst_abc123" }
],
"agents": [
{
"agent_id": "agt_xyz789",
"host_id": "hst_abc123"
}
],
"validation": {
"allowedAgents": ["agt_xyz789"],
"job": [
{
"label": "Job resolved successfully",
"validator": "job_resolution_status",
"pattern": "success"
}
],
"tasks": [
{
"label": "Agent inspected pod logs",
"validator": "pattern_match",
"pattern": "kubectl.*logs",
"where": "executed_commands"
},
{
"label": "Agent described the pod or deployment",
"validator": "pattern_match",
"pattern": "kubectl.*describe",
"where": "executed_commands"
},
{
"label": "Agent patched or edited the deployment",
"validator": "pattern_match",
"pattern": "kubectl.*(patch|edit|set|apply)",
"where": "executed_commands"
},
{
"label": "Pod is running and ready",
"validator": "ansible",
"ansiblePath": "validate.yml"
}
]
}
}
Copy
[k8s]
sandbox-k8s-01 ansible_host=10.0.1.30 ansible_user=ubuntu ansible_ssh_private_key_file=/etc/2501/keys/sandbox.pem
Copy
---
- name: Deploy a crashlooping workload
hosts: k8s
tasks:
- name: Create production namespace
command: kubectl create namespace production
ignore_errors: true
- name: Deploy api-server with a bad env var (wrong DB_HOST)
shell: |
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
namespace: production
spec:
replicas: 1
selector:
matchLabels:
app: api-server
template:
metadata:
labels:
app: api-server
spec:
containers:
- name: api-server
image: busybox
command: ["sh", "-c"]
args:
- |
if [ -z "$DB_HOST" ] || [ "$DB_HOST" = "CHANGEME" ]; then
echo "ERROR: DB_HOST is not configured" >&2
exit 1
fi
echo "Connected to $DB_HOST"
sleep infinity
env:
- name: DB_HOST
value: "CHANGEME"
EOF
- name: Wait for pod to enter CrashLoopBackOff
shell: |
for i in $(seq 1 30); do
STATUS=$(kubectl get pods -n production -l app=api-server -o jsonpath='{.items[0].status.containerStatuses[0].state.waiting.reason}' 2>/dev/null)
if [ "$STATUS" = "CrashLoopBackOff" ]; then exit 0; fi
sleep 5
done
exit 1
Copy
---
- name: Verify api-server pod is running
hosts: k8s
tasks:
- name: Wait for pod to be ready
shell: kubectl wait --for=condition=ready pod -l app=api-server -n production --timeout=120s
- name: Confirm pod is not in an error state
shell: |
STATUS=$(kubectl get pods -n production -l app=api-server -o jsonpath='{.items[0].status.phase}')
[ "$STATUS" = "Running" ]
Copy
---
- name: Remove the test deployment
hosts: k8s
tasks:
- name: Delete api-server deployment
command: kubectl delete deployment api-server -n production
ignore_errors: true
- name: Delete production namespace
command: kubectl delete namespace production
ignore_errors: true

