NELSON HOME OPS CONSOLE
GitHub ↗

Runbooks

Nelson Home — Runbooks

Procedural checklists triggered by events or schedules. Agents: consult this file before performing actions that match an event runbook. Written from real incidents — see .ops/archive/retrospectives/ for context.


Event Runbooks

New Docker Service Added

Derived from: Vaultwarden, Uptime Kuma, Grafana deployments on nelson-manager (Feb 18-19)

  1. Compose file: Create docker-compose/<service>/docker-compose.yml with pinned version tag (never latest for stateful services)
  2. Credentials: All passwords/tokens go in ansible/group_vars/all/secrets.yml (Vault-encrypted). Reference via Jinja2 variables in compose. Never commit plaintext credentials.
  3. containers.yml: Add the service to ansible/group_vars/all/containers.yml if managed by deploy_stack.yml
  4. DNS: Add entry to dns_rewrites list in ansible/group_vars/all/common.yml. Run configure_adguard_dns.yml via Semaphore.
  5. NPM: Add entry to npm_proxy_hosts list in common.yml. Run configure_npm_hosts.yml via Semaphore.
  6. Homepage: Add widget/service entry to docker-compose/homepage/config/services.yaml
  7. Monitoring: Add to Uptime Kuma (when deployed)
  8. CHANGELOG: Log with **FEAT** tag
  9. Verify: Confirm <service>.nelson.home resolves and proxies correctly

New Node Provisioned

Derived from: nelson-identity provisioning (Feb 18), nelson-manager provisioning (Feb 19)

  1. Proxmox: Create VM/LXC via Proxmox UI or API. For Docker workloads in LXC: set Privileged, add lxc.apparmor.profile: unconfined and keyctl=1 feature.
  2. Inventory: Add host to ansible/inventory/hosts.ini with ansible_host, ansible_user=btnelson, and tailscale_ip if applicable. Create inventory group if needed.
  3. SSH keys: Copy operator SSH key (ssh-copy-id btnelson@<new-ip>)
  4. Tailscale: Run deploy_tailscale.yml via Semaphore targeting the new node
  5. Knowledge link: Run knowledge_link.yml to clone the repo onto the new node
  6. Semaphore: Create/update templates that target the new host group
  7. DNS: Add nelson-<role>.nelson.home to DNS rewrites in common.yml. Run configure_adguard_dns.yml.
  8. Backups: Add backup jobs if the node holds state (see backup_vaultwarden.yml / backup_unifi.yml patterns)
  9. PROTOCOL.md: Update Architecture section and Inventory Groups table
  10. ROADMAP.md: Update resource budget table
  11. CHANGELOG: Log with **SUCCESS** tag

Service Decommissioned

Derived from: Semaphore removed from monolith (Feb 18), DuckDNS/qdirstat pending removal

  1. Verify migration: Confirm the replacement service is running and accessible at its new location
  2. Stop container: docker compose down on the old host (or remove from containers.yml and redeploy)
  3. DNS: Remove or update the DNS rewrite in common.yml. Run configure_adguard_dns.yml.
  4. NPM: Remove or update proxy host in common.yml. Run configure_npm_hosts.yml.
  5. Homepage: Remove widget from services.yaml
  6. containers.yml: Remove from active_containers list
  7. Data cleanup: Archive any persistent data if not migrated. Remove docker volumes if no longer needed.
  8. Verify: Confirm the old endpoint no longer resolves / the new endpoint works
  9. CHANGELOG: Log with **DECOMMISSIONED** tag

Node Decommissioned

Derived from: nelson-identity LXC decommission (Feb 19 — snapshot backup, then destroy)

  1. Pre-flight: Verify all services have been migrated off. Run a final audit (audit_docker.yml, audit_proxmox_api.yml) to confirm nothing is still running.
  2. Backup: Take a Proxmox snapshot or vzdump backup before destruction
  3. Inventory: Remove host from ansible/inventory/hosts.ini
  4. DNS: Remove node-specific DNS entries from common.yml. Run configure_adguard_dns.yml.
  5. NPM: Remove any proxy rules pointing to the old IP. Run configure_npm_hosts.yml.
  6. Tailscale: Remove node from tailnet (tailscale admin or via Tailscale web console)
  7. Proxmox: Destroy the VM/LXC
  8. PROTOCOL.md: Update Architecture section and Inventory Groups table
  9. ROADMAP.md: Update resource budget table
  10. KNOWLEDGE.md: Update any references to the old node name/IP
  11. Semaphore: Remove or update templates that targeted the old host group
  12. CHANGELOG: Log with **DECOMMISSIONED** tag

DNS / Routing Changed

Derived from: NPM migration from monolith to nelson-edge, DNS rewrite updates (Feb 19)

  1. common.yml: Update npm_proxy_hosts and/or dns_rewrites entries with new IPs/ports
  2. Run playbooks: configure_npm_hosts.yml and configure_adguard_dns.yml via Semaphore
  3. Port forwarding: If external-facing, update UniFi port forwarding rules (currently via UniFi UI)
  4. SSL certificates: If domains changed, provision new certs (via NPM or Caddy)
  5. Homepage: Update dashboard URLs if service locations changed
  6. Verify: Test each changed domain resolves and proxies to the correct backend
  7. Audit: Run audit_npm.yml and audit_network.yml to confirm clean state
  8. CHANGELOG: Log changes with service-specific details

Scheduled Runbooks

Time-triggered checks run via Semaphore cron schedules on nelson-manager (192.168.1.30:3010). Results should be pushed to the notification/monitoring system (not yet deployed — see SPRINT.md).

Weekly: System Health Review

  • Run audit_master.yml via Semaphore and review generated reports
  • Check .ops/SPRINT.md for stale tasks (no progress in 7+ days)
  • Verify backups ran successfully (check /mnt/nelson-backups file timestamps)
  • Review AdGuard query logs for anomalies
  • Check Proxmox host resource usage (RAM/CPU/disk headroom)

Weekly: Documentation Sync

  • Verify containers.yml matches actually running containers
  • Verify common.yml DNS rewrites and NPM hosts match live state
  • Check .ops/STANDUP.md is current (not stale for more than 2 days)
  • Archive any standup entries older than 3 days
  • Check .ops/PROTOCOL.md architecture section matches actual running nodes

Monthly: Security Review

  • Check for pending OS security updates on all nodes
  • Review Ansible Vault secrets for any that need rotation
  • Audit SSH keys across nodes (no unauthorized keys)
  • Review NPM SSL certificate expiry dates
  • Check for Docker image updates on critical services (compare pinned vs latest)

Per-Phase: Retrospective

  • Run at the end of each migration phase (or monthly if no phase completes)
  • Review git log, CHANGELOG, and STANDUP for the period
  • Identify: what went well, what went wrong, actionable lessons
  • Write up in .ops/archive/retrospectives/YYYY-MM-DD-<title>.md
  • Extract durable learnings into KNOWLEDGE.md
  • Update PROTOCOL.md if conventions need to change
  • Archive completed sprint tasks to .ops/archive/sprints/