Runbooks
Nelson Home — Runbooks
Procedural checklists triggered by events or schedules. Agents: consult this file before performing actions that match an event runbook.
Written from real incidents — see .ops/archive/retrospectives/ for context.
Event Runbooks
New Docker Service Added
Derived from: Vaultwarden, Uptime Kuma, Grafana deployments on nelson-manager (Feb 18-19)
- Compose file: Create
docker-compose/<service>/docker-compose.ymlwith pinned version tag (neverlatestfor stateful services) - Credentials: All passwords/tokens go in
ansible/group_vars/all/secrets.yml(Vault-encrypted). Reference via Jinja2 variables in compose. Never commit plaintext credentials. - containers.yml: Add the service to
ansible/group_vars/all/containers.ymlif managed bydeploy_stack.yml - DNS: Add entry to
dns_rewriteslist inansible/group_vars/all/common.yml. Runconfigure_adguard_dns.ymlvia Semaphore. - NPM: Add entry to
npm_proxy_hostslist incommon.yml. Runconfigure_npm_hosts.ymlvia Semaphore. - Homepage: Add widget/service entry to
docker-compose/homepage/config/services.yaml - Monitoring: Add to Uptime Kuma (when deployed)
- CHANGELOG: Log with
**FEAT**tag - Verify: Confirm
<service>.nelson.homeresolves and proxies correctly
New Node Provisioned
Derived from: nelson-identity provisioning (Feb 18), nelson-manager provisioning (Feb 19)
- Proxmox: Create VM/LXC via Proxmox UI or API. For Docker workloads in LXC: set Privileged, add
lxc.apparmor.profile: unconfinedandkeyctl=1feature. - Inventory: Add host to
ansible/inventory/hosts.iniwithansible_host,ansible_user=btnelson, andtailscale_ipif applicable. Create inventory group if needed. - SSH keys: Copy operator SSH key (
ssh-copy-id btnelson@<new-ip>) - Tailscale: Run
deploy_tailscale.ymlvia Semaphore targeting the new node - Knowledge link: Run
knowledge_link.ymlto clone the repo onto the new node - Semaphore: Create/update templates that target the new host group
- DNS: Add
nelson-<role>.nelson.hometo DNS rewrites incommon.yml. Runconfigure_adguard_dns.yml. - Backups: Add backup jobs if the node holds state (see
backup_vaultwarden.yml/backup_unifi.ymlpatterns) - PROTOCOL.md: Update Architecture section and Inventory Groups table
- ROADMAP.md: Update resource budget table
- CHANGELOG: Log with
**SUCCESS**tag
Service Decommissioned
Derived from: Semaphore removed from monolith (Feb 18), DuckDNS/qdirstat pending removal
- Verify migration: Confirm the replacement service is running and accessible at its new location
- Stop container:
docker compose downon the old host (or remove fromcontainers.ymland redeploy) - DNS: Remove or update the DNS rewrite in
common.yml. Runconfigure_adguard_dns.yml. - NPM: Remove or update proxy host in
common.yml. Runconfigure_npm_hosts.yml. - Homepage: Remove widget from
services.yaml - containers.yml: Remove from
active_containerslist - Data cleanup: Archive any persistent data if not migrated. Remove docker volumes if no longer needed.
- Verify: Confirm the old endpoint no longer resolves / the new endpoint works
- CHANGELOG: Log with
**DECOMMISSIONED**tag
Node Decommissioned
Derived from: nelson-identity LXC decommission (Feb 19 — snapshot backup, then destroy)
- Pre-flight: Verify all services have been migrated off. Run a final audit (
audit_docker.yml,audit_proxmox_api.yml) to confirm nothing is still running. - Backup: Take a Proxmox snapshot or vzdump backup before destruction
- Inventory: Remove host from
ansible/inventory/hosts.ini - DNS: Remove node-specific DNS entries from
common.yml. Runconfigure_adguard_dns.yml. - NPM: Remove any proxy rules pointing to the old IP. Run
configure_npm_hosts.yml. - Tailscale: Remove node from tailnet (
tailscale adminor via Tailscale web console) - Proxmox: Destroy the VM/LXC
- PROTOCOL.md: Update Architecture section and Inventory Groups table
- ROADMAP.md: Update resource budget table
- KNOWLEDGE.md: Update any references to the old node name/IP
- Semaphore: Remove or update templates that targeted the old host group
- CHANGELOG: Log with
**DECOMMISSIONED**tag
DNS / Routing Changed
Derived from: NPM migration from monolith to nelson-edge, DNS rewrite updates (Feb 19)
- common.yml: Update
npm_proxy_hostsand/ordns_rewritesentries with new IPs/ports - Run playbooks:
configure_npm_hosts.ymlandconfigure_adguard_dns.ymlvia Semaphore - Port forwarding: If external-facing, update UniFi port forwarding rules (currently via UniFi UI)
- SSL certificates: If domains changed, provision new certs (via NPM or Caddy)
- Homepage: Update dashboard URLs if service locations changed
- Verify: Test each changed domain resolves and proxies to the correct backend
- Audit: Run
audit_npm.ymlandaudit_network.ymlto confirm clean state - CHANGELOG: Log changes with service-specific details
Scheduled Runbooks
Time-triggered checks run via Semaphore cron schedules on nelson-manager (192.168.1.30:3010). Results should be pushed to the notification/monitoring system (not yet deployed — see SPRINT.md).
Weekly: System Health Review
- Run
audit_master.ymlvia Semaphore and review generated reports - Check
.ops/SPRINT.mdfor stale tasks (no progress in 7+ days) - Verify backups ran successfully (check
/mnt/nelson-backupsfile timestamps) - Review AdGuard query logs for anomalies
- Check Proxmox host resource usage (RAM/CPU/disk headroom)
Weekly: Documentation Sync
- Verify
containers.ymlmatches actually running containers - Verify
common.ymlDNS rewrites and NPM hosts match live state - Check
.ops/STANDUP.mdis current (not stale for more than 2 days) - Archive any standup entries older than 3 days
- Check
.ops/PROTOCOL.mdarchitecture section matches actual running nodes
Monthly: Security Review
- Check for pending OS security updates on all nodes
- Review Ansible Vault secrets for any that need rotation
- Audit SSH keys across nodes (no unauthorized keys)
- Review NPM SSL certificate expiry dates
- Check for Docker image updates on critical services (compare pinned vs latest)
Per-Phase: Retrospective
- Run at the end of each migration phase (or monthly if no phase completes)
- Review git log, CHANGELOG, and STANDUP for the period
- Identify: what went well, what went wrong, actionable lessons
- Write up in
.ops/archive/retrospectives/YYYY-MM-DD-<title>.md - Extract durable learnings into KNOWLEDGE.md
- Update PROTOCOL.md if conventions need to change
- Archive completed sprint tasks to
.ops/archive/sprints/