Runbooks

Nelson Home — Runbooks

Procedural checklists triggered by events or schedules. Agents: consult this file before performing actions that match an event runbook. Written from real incidents — see .ops/archive/retrospectives/ for context.

Event Runbooks

New Docker Service Added

Derived from: Vaultwarden, Uptime Kuma, Grafana deployments on nelson-manager (Feb 18-19)

Compose file: Create docker-compose/<service>/docker-compose.yml with pinned version tag (never latest for stateful services)
Credentials: All passwords/tokens go in ansible/group_vars/all/secrets.yml (Vault-encrypted). Reference via Jinja2 variables in compose. Never commit plaintext credentials.
containers.yml: Add the service to ansible/group_vars/all/containers.yml if managed by deploy_stack.yml
DNS: Add entry to dns_rewrites list in ansible/group_vars/all/common.yml. Run configure_adguard_dns.yml via Semaphore.
NPM: Add entry to npm_proxy_hosts list in common.yml. Run configure_npm_hosts.yml via Semaphore.
Homepage: Add widget/service entry to docker-compose/homepage/config/services.yaml
Monitoring: Add to Uptime Kuma (when deployed)
CHANGELOG: Log with **FEAT** tag
Verify: Confirm <service>.nelson.home resolves and proxies correctly

New Node Provisioned

Derived from: nelson-identity provisioning (Feb 18), nelson-manager provisioning (Feb 19)

Proxmox: Create VM/LXC via Proxmox UI or API. For Docker workloads in LXC: set Privileged, add lxc.apparmor.profile: unconfined and keyctl=1 feature.
Inventory: Add host to ansible/inventory/hosts.ini with ansible_host, ansible_user=btnelson, and tailscale_ip if applicable. Create inventory group if needed.
SSH keys: Copy operator SSH key (ssh-copy-id btnelson@<new-ip>)
Tailscale: Run deploy_tailscale.yml via Semaphore targeting the new node
Knowledge link: Run knowledge_link.yml to clone the repo onto the new node
Semaphore: Create/update templates that target the new host group
DNS: Add nelson-<role>.nelson.home to DNS rewrites in common.yml. Run configure_adguard_dns.yml.
Backups: Add backup jobs if the node holds state (see backup_vaultwarden.yml / backup_unifi.yml patterns)
PROTOCOL.md: Update Architecture section and Inventory Groups table
ROADMAP.md: Update resource budget table
CHANGELOG: Log with **SUCCESS** tag

Service Decommissioned

Derived from: Semaphore removed from monolith (Feb 18), DuckDNS/qdirstat pending removal

Verify migration: Confirm the replacement service is running and accessible at its new location
Stop container: docker compose down on the old host (or remove from containers.yml and redeploy)
DNS: Remove or update the DNS rewrite in common.yml. Run configure_adguard_dns.yml.
NPM: Remove or update proxy host in common.yml. Run configure_npm_hosts.yml.
Homepage: Remove widget from services.yaml
containers.yml: Remove from active_containers list
Data cleanup: Archive any persistent data if not migrated. Remove docker volumes if no longer needed.
Verify: Confirm the old endpoint no longer resolves / the new endpoint works
CHANGELOG: Log with **DECOMMISSIONED** tag

Node Decommissioned

Derived from: nelson-identity LXC decommission (Feb 19 — snapshot backup, then destroy)

Pre-flight: Verify all services have been migrated off. Run a final audit (audit_docker.yml, audit_proxmox_api.yml) to confirm nothing is still running.
Backup: Take a Proxmox snapshot or vzdump backup before destruction
Inventory: Remove host from ansible/inventory/hosts.ini
DNS: Remove node-specific DNS entries from common.yml. Run configure_adguard_dns.yml.
NPM: Remove any proxy rules pointing to the old IP. Run configure_npm_hosts.yml.
Tailscale: Remove node from tailnet (tailscale admin or via Tailscale web console)
Proxmox: Destroy the VM/LXC
PROTOCOL.md: Update Architecture section and Inventory Groups table
ROADMAP.md: Update resource budget table
KNOWLEDGE.md: Update any references to the old node name/IP
Semaphore: Remove or update templates that targeted the old host group
CHANGELOG: Log with **DECOMMISSIONED** tag

DNS / Routing Changed

Derived from: NPM migration from monolith to nelson-edge, DNS rewrite updates (Feb 19)

common.yml: Update npm_proxy_hosts and/or dns_rewrites entries with new IPs/ports
Run playbooks: configure_npm_hosts.yml and configure_adguard_dns.yml via Semaphore
Port forwarding: If external-facing, update UniFi port forwarding rules (currently via UniFi UI)
SSL certificates: If domains changed, provision new certs (via NPM or Caddy)
Homepage: Update dashboard URLs if service locations changed
Verify: Test each changed domain resolves and proxies to the correct backend
Audit: Run audit_npm.yml and audit_network.yml to confirm clean state
CHANGELOG: Log changes with service-specific details

Scheduled Runbooks

Time-triggered checks run via Semaphore cron schedules on nelson-manager (192.168.1.30:3010). Results should be pushed to the notification/monitoring system (not yet deployed — see SPRINT.md).

Weekly: System Health Review

Run audit_master.yml via Semaphore and review generated reports
Check .ops/SPRINT.md for stale tasks (no progress in 7+ days)
Verify backups ran successfully (check /mnt/nelson-backups file timestamps)
Review AdGuard query logs for anomalies
Check Proxmox host resource usage (RAM/CPU/disk headroom)

Weekly: Documentation Sync

Verify containers.yml matches actually running containers
Verify common.yml DNS rewrites and NPM hosts match live state
Check .ops/STANDUP.md is current (not stale for more than 2 days)
Archive any standup entries older than 3 days
Check .ops/PROTOCOL.md architecture section matches actual running nodes

Monthly: Security Review

Check for pending OS security updates on all nodes
Review Ansible Vault secrets for any that need rotation
Audit SSH keys across nodes (no unauthorized keys)
Review NPM SSL certificate expiry dates
Check for Docker image updates on critical services (compare pinned vs latest)

Per-Phase: Retrospective

Run at the end of each migration phase (or monthly if no phase completes)
Review git log, CHANGELOG, and STANDUP for the period
Identify: what went well, what went wrong, actionable lessons
Write up in .ops/archive/retrospectives/YYYY-MM-DD-<title>.md
Extract durable learnings into KNOWLEDGE.md
Update PROTOCOL.md if conventions need to change
Archive completed sprint tasks to .ops/archive/sprints/