NELSON HOME OPS CONSOLE
GitHub ↗

Knowledge Base

Nelson Home — Operational Knowledge Base

Durable patterns, API details, and gotchas. Organized by topic, not by date. Agents: add new sections when you discover patterns worth preserving.

Public Exposure & Domain Strategy

  • Owned Domains:
    • tudhopenelson.duckdns.org (Dynamic DNS)
    • tudhopenelson.com
    • palladiumresearch.com
    • tanzolabs.com
  • Active Public Endpoints:
    • proxmox.tudhopenelson.duckdns.orgnelson-pve:8006 (SSL: Let's Encrypt via NPM)
  • Strategy Checklist:
    • [ ] Determine which internal services (Nextcloud, Vaultwarden, etc.) should be exposed via *.tudhopenelson.com.
    • [ ] Configure DNS-01 challenge for wildcard SSL if moving to Caddy.
    • [ ] Implement Access Lists in NPM/Caddy for sensitive public endpoints. [Gemini, 2026-02-19 22:10]

Identity & SSH Hardening (Deterministic Identity)

To prevent "unreachable" errors and authentication drift, Nelson Home uses a Deterministic Identity Architecture:

  • Semaphore Identity: A dedicated SSH key pair (semaphore@nelson.home) is the sole identity for automation.
  • IaC Source of Truth: The public half of this key is stored in ansible/group_vars/all/common.yml as automation_public_key.
  • Enforcement: The ansible/harden_ssh_keys.yml playbook ensures this exact key is authorized on every node in the lab.
  • Bootstrap Pattern: When adding a new node or recovering from an auth failure, run .ops/bootstrap_identity.sh from your MacBook. This script uses standard SSH to "push" the Semaphore identity to all nodes without requiring Ansible on your workstation. [Gemini, 2026-03-07]

Robust Audit Strategy

To ensure audits run across heterogeneous nodes (different OS versions, missing Python libraries, restricted API access), the following patterns are now preferred in Nelson Home:

  • Proxmox Infrastructure: Use native pvesh on the Proxmox host (hosts: proxmox_nodes) instead of external API calls. This avoids the need for external API tokens and handles authentication natively via SSH.
  • Docker Audits: Use a shell-based fallback (docker inspect) when the community.docker.docker_container_info module fails due to a missing requests library on the target node (common on minimal LXC/edge installs).
  • Multi-Node Aggregation: For audits covering multiple hosts, collect information into host-specific facts and then use a run_once task on localhost to aggregate them into a single markdown report. This prevents race conditions and ensures a complete view of the lab. [Gemini, 2026-03-07]

UniFi + MongoDB

  • The linuxserver/unifi-network-application:latest image now requires the unifi MongoDB user to have dbOwner on three databases: unifi, unifi_stat, and unifi_audit.
  • The community init-mongo.sh script only creates unifi and unifi_stat — it is missing unifi_audit.
  • Newer image versions (pulled 2026-02-19) added the unifi_audit requirement. Tomcat fails to start with error 13 (Unauthorized) if the permission is missing.
  • Live fix (no data loss): docker exec unifi-db mongosh --username root --password example --authenticationDatabase admin --eval 'db.getSiblingDB("admin").grantRolesToUser("unifi", [{ role: "dbOwner", db: "unifi_audit" }])'
  • init-mongo.sh has been patched to include unifi_audit for future fresh installs. [Claude Code, 2026-02-19]

UniFi Device Adoption (Force Re-Adopt)

When a UniFi device gets stuck adopting after the controller moves to a new IP:

  1. SSH into the device (e.g., USG at 192.168.1.1): ssh btnelson@192.168.1.1
    • Password: stored in Vaultwarden (do NOT put here — this file is in git)
  2. Enter the UniFi management console: mca-cli
  3. Point the device at the controller's inform URL: set-inform http://<controller-ip>:8080/inform
    • Current controller: 192.168.1.11:8080 (ubuntu-server, migrate to nelson-manager)
  4. The device should appear as "Pending Adoption" in the controller UI within ~30 seconds
  5. Click Adopt in the controller UI

Applies to: USG, APs, switches — any UniFi device that loses track of the controller.

Docker-in-LXC: sysctl net.ipv4.ip_unprivileged_port_start Permission Denied

  • Symptom: Docker containers (NPM, AdGuard Home) in a Proxmox LXC fail to start with error: open sysctl net.ipv4.ip_unprivileged_port_start file: reopen fd 8: permission denied.
  • Root Cause: Newer Docker/runc versions (e.g., 28.x) attempt to set certain sysctls during container initialization. Even in a privileged LXC, the default AppArmor profile can block these calls.
  • Fix: Set lxc.apparmor.profile: unconfined in the LXC configuration file on the Proxmox host (/etc/pve/lxc/<vmid>.conf) and restart the LXC.
  • Prevention: Always use lxc.apparmor.profile: unconfined for LXCs hosting Docker workloads in Nelson Home, especially when using newer Docker versions. [Gemini, 2026-02-19 21:50]

AdGuard Home API

  • REST API runs on port 80 (same as dashboard). Port 3000 is the initial setup wizard ONLY.
  • Auth: HTTP Basic Auth (force_basic_auth: yes in Ansible).
  • List rewrites: GET /control/rewrite/list — returns [{domain, answer}].
  • Add rewrite: POST /control/rewrite/add with {domain, answer} body.

Nginx Proxy Manager API

  • Auth: POST /api/tokens with {identity, secret} — returns {token}.
  • Use Authorization: Bearer <token> for all subsequent calls.
  • List hosts: GET /api/nginx/proxy-hosts — array of proxy host objects.
  • Create: POST (status 201). Update: PUT /api/nginx/proxy-hosts/<id> (status 200).
  • Idempotency pattern: Fetch all hosts, build dict keyed by domain_names[0]. For each desired host: not in map = create, in map + changed = update, otherwise skip.

Semaphore Template Configuration

When configuring a new template in Semaphore, ensure the following are set correctly to avoid execution failures:

  1. Variable Group: Every template must have a Variable Group assigned (e.g., "Default" or "Unifi Secrets") to provide required variables like unifi_password, npm_user, etc.
  2. Vaults: Even if a playbook does not explicitly use secrets, you must assign the "default" vault key if the repository contains any encrypted files (like ansible/group_vars/all/secrets.yml). Ansible decrypts all group variables at startup; without the vault key, the task will fail with "Attempting to decrypt but no vault secrets found".
  3. Specific Example (Audit Network):
    • Variable Group: Unifi Secrets (contains unifi_password for the ansible user).
    • Vaults: default (to allow Ansible to load the encrypted secrets.yml).

Service Credentials (.env.local)

  • All service credentials are stored in .env.local in the repo root (gitignored).
  • Agents MUST read .env.local to get API credentials before calling service APIs.
  • Variables: SEMAPHORE_USER, SEMAPHORE_PASSWORD, SEMAPHORE_URL, NPM_USER, NPM_PASSWORD, ADGUARD_USER, ADGUARD_PASSWORD, PROXMOX_PASSWORD, ANSIBLE_VAULT_PASSWORD.
  • .env.local is the Mac-local credential store. It is NOT committed to git.
  • If a credential doesn't work, it may have been rotated — check Vaultwarden. [Claude Code, 2026-03-08 09:00]

Semaphore API

  • Credentials: Read from .env.local (SEMAPHORE_USER, SEMAPHORE_PASSWORD, SEMAPHORE_URL).
  • Auth: POST /api/auth/login with cookie jar to get session.
  • List templates: GET /api/project/1/templates.
  • Run task: POST /api/project/1/tasks with {template_id}.
  • Poll: GET /api/project/1/tasks/<id> (check status == "success").
  • Output: GET /api/project/1/tasks/<id>/output.
  • Template update gotcha: PUT body MUST include "id": <template_id> matching the URL, or Semaphore returns 400.

Semaphore Networking

  • Semaphore runs in Docker with its own network namespace.
  • 127.0.0.1 inside Semaphore = container loopback, NOT the host.
  • Always use the LAN IP (currently 192.168.1.30) to reach host services from playbooks running hosts: localhost in Semaphore.

Ansible + Semaphore Vault

  • Even if a playbook doesn't reference vault variables, Ansible still decrypts group_vars/all/secrets.yml when loading group vars.
  • Semaphore templates MUST have vault_key_id assigned or they fail with "Attempting to decrypt but no vault secrets found".

hostvars Key Names

  • hostvars dict is keyed by the inventory hostname, not alias.
  • hostvars['nelson-edge']['ansible_host'] works (LXC).
  • hostvars['nelson-pve']['ansible_host'] works (Proxmox).
  • The ubuntu-server host's inventory key is its IP 192.168.1.11, not the string ubuntu-server.

Homepage Dashboard

  • Homepage watches its config directory (mounted from repo).
  • A git pull on the server is sufficient to update — no container restart needed.
  • Use sync_repo.yml for config-only changes instead of deploy_stack.yml.

Proxmox Operations

  • Always use delegate_to: nelson-pve for tasks on Proxmox, never raw SSH shell commands.
  • Backup drive: 5TB WD Elements at /mnt/nelson-backups (UUID: 03575663-f377-4557-8d64-8bc9f161916e).
  • nelson-manager is a Privileged LXC (required for Docker sysctl + AppArmor). nelson-identity was decommissioned 2026-02-19.

Node Network Interfaces

nelson-edge (Proxmox LXC, 192.168.1.2)

  • Primary entrypoint for internal DNS and SSL routing.
  • Runs Docker workloads inside LXC (requires unconfined AppArmor profile).
  • Legacy Pi still on network: A physical Raspberry Pi (MAC dc:a6:32:a5:4d:fb, hostname "nelson-edge") appears at 192.168.1.247 (wired) and 192.168.1.90 (wireless) in network audits. This is the old edge node before migration to Proxmox LXC. It is not the active edge node. Do not confuse it with the LXC at 192.168.1.2.

SSH & Git

  • Always ensure origin is set to SSH (git@github.com) on controller nodes to prevent auth hangs.
  • Never overwrite existing SSH keys unless specifically requested; reuse existing keys.

Tailscale DNS & Split DNS

  • Tailnet: tadpole-dory.ts.net with MagicDNS enabled.
  • Split DNS: .nelson.home queries are routed to 100.77.163.93 (nelson-edge's Tailscale IP → AdGuard Home).
  • Global nameservers: Cloudflare (1.1.1.1) + Google (8.8.8.8) for everything else. "Override DNS servers" is ON.
  • Gotcha (2026-03-08): The old global nameserver 100.75.196.25 (nelson-pi) had stale DNS records pointing .nelson.home services to 192.168.1.11 (monolith) instead of 192.168.1.2 (nelson-edge/NPM). This broke all .nelson.home URLs on Tailscale-connected devices. Fix: replaced with split DNS rule pointing .nelson.home → nelson-edge's AdGuard.
  • Prevention: When changing DNS infrastructure, verify Tailscale nameserver config matches — Tailscale DNS overrides local network DNS when "Override DNS servers" is enabled. [Claude Code, 2026-03-08 00:15]

Observability Architecture

Nelson Home uses a two-tier alerting strategy:

Grafana (resource thresholds):

  • Datasource: Prometheus (uid: Prometheus, URL: http://prometheus:9090)
  • Custom dashboard: "Nelson Home Overview" (uid: nelson-home-overview) — set as home dashboard
  • Alert rules (folder: Nelson Home Alerts, group: infrastructure):
    • Node Down: up{job="node"} < 1 for 2m → critical
    • High CPU: > 90% for 5m → warning
    • High Memory: > 90% for 5m → warning
    • Disk Critical: > 85% for 5m → critical
  • Contact point: Telegram bot nelson-home (chat ID: 8150264504)
  • Notification policy: group by alertname, wait 30s, repeat 4h

Uptime Kuma (service availability):

  • 13 monitors: Semaphore, Vaultwarden, AdGuard, UniFi, Proxmox, Grafana, Prometheus, NPM, Ender 3, + 4 node pings
  • API key stored in Semaphore Default variable group as uptime_kuma_api_key
  • Telegram notification: same bot, applied to all monitors as default

Why both: Grafana depends on Prometheus — if Prometheus dies, Grafana can't alert. Kuma is standalone and will still catch service outages. They don't duplicate: Grafana watches performance, Kuma watches reachability.

Semaphore variables for monitoring:

  • grafana_admin_user / grafana_admin_password — Grafana login
  • uptime_kuma_api_key — Kuma automation key (expires 2027-03-08)
  • Telegram bot token stored in Grafana contact point config (not in Semaphore)

[Claude Code, 2026-03-08 09:40]

Grafana Provisioning: UID Behavior

  • If a datasource is first provisioned without a uid, Grafana auto-generates one (e.g., PBFA97CFB590B2093). Adding uid: Prometheus later does NOT update the existing datasource — Grafana matches by name and keeps the old UID.
  • Fix: Add a deleteDatasources block to the provisioning YAML to force deletion and re-creation with the correct UID on next startup.
  • Dashboard panels referencing "uid": "Prometheus" will show "No data" if the actual datasource UID differs. [Claude Code, 2026-03-08 09:10]

Docker Compose Bind Mounts & Redeploy

  • docker compose up -d with state: present does NOT recreate containers when only bind-mounted config files change. The container keeps the old cached file.
  • Fix: Use recreate: always in community.docker.docker_compose_v2 or --force-recreate flag. Alternatively, use Prometheus lifecycle API (POST /-/reload) for Prometheus-specific config reloads — but this only works if the bind mount itself is refreshed. [Claude Code, 2026-03-08 09:05]

Uptime Kuma API

  • Kuma uses Socket.IO, not REST. Cannot automate via curl. Use the uptime-kuma-api Python library instead.
  • Must run from a host that can reach Kuma directly (localhost on nelson-manager, not from Mac unless port is exposed).
  • API key format: uk1_*, created via api.add_api_key(name=..., expires=..., active=True). Key is shown once — store immediately.
  • Monitor types: MonitorType.HTTP, MonitorType.PING, MonitorType.KEYWORD (for self-signed HTTPS with ignoreTls=True). [Claude Code, 2026-03-08 09:15]

Moonraker Prometheus Support

  • Moonraker [prometheus] component requires a version newer than v0.10.0. On Raspbian Bullseye (32-bit), the Update Manager may not offer newer versions.
  • Symptom: "Unparsed config section [prometheus]" warning + "failed to load component" error in Mainsail UI.
  • Workaround: Monitor printer via Uptime Kuma HTTP check on /server/info endpoint instead. [Claude Code, 2026-03-08 09:20]

Nelson Ops: External CDN Dependencies

  • Rule: Never use CDN-hosted JS libraries (unpkg.com, cdnjs, etc.) in Nelson Ops. AdGuard or local DNS may block or fail to resolve external CDN domains.
  • Pattern: Download the library and serve from /public/ instead. Example: vis-network.min.js (689KB) is served locally at /vis-network.min.js. [Claude Code, 2026-03-08 00:15]

Nelson Ops — Dev Workflow

Nelson Ops (docker-compose/nelson-ops/) is a web application, not infrastructure config. It has its own dev loop — do NOT use Semaphore or deploy_stack for changes to this app.

Stack: Node.js/Express, server-rendered HTML, LCARS CSS, no build step.

Dev loop (edit → deploy in ~5 seconds):

# 1. Edit files locally in docker-compose/nelson-ops/app/
# 2. Run tests
cd docker-compose/nelson-ops/app && node --test lib/*.test.js
# 3. Commit + push
git add docker-compose/nelson-ops/ && git commit -m "..." && git push
# 4. Deploy — one command, nodemon auto-reloads
ssh btnelson@192.168.1.30 "git -C ~/nelson-server-config pull"

Why no Semaphore: deploy_stack.yml uses rsync with --delete-after and become: true, which causes permission issues and deletes app files. Nelson Ops is a simple app — git pull is sufficient because the container volume-mounts the repo directory and nodemon watches for changes.

Container details:

  • Runs on nelson-manager (192.168.1.30:3020), proxied at ops.nelson.home
  • Docker compose mounts ./app:/app:rw and /home/btnelson/nelson-server-config:/repo:rw
  • Command: npm install && npx -y nodemon -w /app server.js
  • After git pull, nodemon detects file changes and restarts automatically

Known issue: deploy_stack.yml rsync can delete docker-compose.yml and app files on nelson-manager due to permission conflicts. If the container won't start after a Semaphore deploy, run git checkout -- docker-compose/nelson-ops/ on nelson-manager to restore.

[Claude Code, 2026-03-08 22:15]

Nelson Ops: Adding New npm Dependencies

  • Symptom: After git pull with a new dependency in package.json, nodemon detects file changes and restarts the app — but the app crashes with Cannot find module '<new-package>' because npm install hasn't run yet.
  • Root Cause: The container command is npm install && npx -y nodemon. On initial start, npm install runs. But nodemon restarts only node server.js — it does NOT re-run npm install.
  • Fix: docker restart nelson-ops — this re-runs the full entrypoint (npm install && nodemon), installing the new dependency.
  • Prevention: After pushing a commit that adds a new npm dependency, always follow up with ssh btnelson@192.168.1.30 "docker restart nelson-ops". [Claude Code, 2026-03-08 23:45]

cAdvisor Docker API Version Compatibility

  • cAdvisor v0.49.1 uses Docker API client v1.41. If the Docker daemon requires minimum API v1.44+ (Docker 27+), cAdvisor fails to register the Docker container factory and falls back to systemd cgroup monitoring.
  • Symptom: Container metrics have id labels like /system.slice/docker-xxx.scope instead of name labels. Dashboard queries filtering name!="" return empty.
  • Fix: Upgrade to cAdvisor v0.51.0+ which supports newer Docker API versions.
  • Prevention: When deploying cAdvisor via shell (docker run), always add docker pull <image> before docker run to avoid using a cached old image. The deploy_node_exporter.yml playbook now does this. [Claude Code, 2026-03-08 10:00]

Home Assistant Reverse Proxy (Trusted Proxies)

  • HA rejects requests with non-matching Host headers unless http.trusted_proxies is configured.
  • Symptom: 400 Bad Request when accessing HA through NPM proxy (e.g., ha.nelson.home), but direct access (192.168.1.11:8123) works.
  • Fix: Add to configuration.yaml:
    http:
      use_x_forwarded_for: true
      trusted_proxies:
        - 192.168.1.2
    
  • HA config lives at /opt/docker-data/homeassistant/configuration.yaml on ubuntu-server (NOT /home/btnelson/homeassistant/). [Claude Code, 2026-03-08 11:15]

Unpoller (UniFi Monitoring)

  • Unpoller v2.34.0 deployed on nelson-manager as part of the monitoring compose stack.
  • Connects to UniFi Controller API at 192.168.1.30:8443 with read-only unpoller user.
  • Exports Prometheus metrics on :9130 with namespace unifipoller.
  • InfluxDB: Enabled by default even if not used — set UP_INFLUXDB_DISABLE=true to suppress connection refused errors.
  • Credentials: unpoller_password stored in Semaphore Default environment (not Ansible Vault). Templated into .env on deploy.
  • Dashboards: 6 community dashboards imported (IDs 11310-11315) + UniFi summary row on Nelson Home Overview. [Claude Code, 2026-03-08 10:30]

Semaphore + Ansible Vault: secrets.yml in .gitignore

  • ansible/group_vars/all/secrets.yml is in .gitignore. All secrets are managed via Semaphore environment variables (Default environment), NOT Ansible Vault files in the repo.
  • If secrets.yml is accidentally committed, Semaphore deploys fail with "Attempting to decrypt but no vault secrets found" because the template doesn't have a vault key linked.
  • Rule: Add new secrets to Semaphore Default environment (PUT /api/project/1/environment/4), not to secrets.yml. [Claude Code, 2026-03-08 10:15]

Naming Standard

  • Official name: Nelson Home
  • Internal domain: nelson.home
  • Infrastructure slug: nelson-home
  • Node convention: nelson-<role> (e.g., nelson-edge, nelson-manager, nelson-apps)
  • Tailnet: tadpole-dory.ts.net

Version Pinning (Standing Rule)

Never use latest tags for infrastructure services. Pin to a specific version in docker-compose files.

  • Why: On 2026-02-19, linuxserver/unifi-network-application:latest pulled a breaking change that required a new MongoDB permission (unifi_audit) and showed a setup wizard despite intact data. Recovery took 7 commits.
  • Rule: After deploying a new service or updating an image, immediately pin the working version tag in the compose file.
  • Exception: Dashboard-only services (Homepage, Glances) that don't hold state can use latest with caution.
  • Current pinned versions: UniFi 9.0.114, MongoDB 8.0, Postgres 15. [Claude Code, 2026-02-19 16:30]

Atomic Documentation Updates (Standing Rule)

When you change an IP, rename a node, move a service, or decommission anything, update all references in the same session. Files to check:

  1. ansible/inventory/hosts.ini — host entries and group names
  2. ansible/group_vars/all/common.yml — URLs, IPs, NPM hosts, DNS rewrites
  3. .ops/PROTOCOL.md — architecture section, inventory groups table, Semaphore URL
  4. .ops/ROADMAP.md — phase descriptions and resource budget
  5. .ops/RUNBOOKS.md — any URLs or IPs in scheduled runbooks
  6. docker-compose/*/docker-compose.yml — any hardcoded IPs
  7. GEMINI.md / dashboard URLs
  8. Homepage config files
  9. ~/.claude/projects/.../memory/MEMORY.md — safety rules block (Semaphore IP, critical nodes)

A stale IP in PROTOCOL.md is worse than no documentation — it gives agents false confidence and causes silent failures. [Claude Code, 2026-02-19 16:30 | updated 2026-02-20: added item 9 (MEMORY.md)]

Credential Hygiene

Known plaintext credentials in git history (as of 2026-02-19):

  • docker-compose/unifi/docker-compose.yml — MongoDB root example, unifi user unifi111
  • docker-compose/semaphore/docker-compose.yml — admin password admin
  • docker-compose/pulse/docker-compose.ymlpassword123
  • .env.local — all service credentials (gitignored but modified locally)

These cannot be removed from git history without a force-push/rewrite. The mitigation strategy is:

  1. Template credentials into Ansible Vault variables (SPRINT.md tracks this)
  2. Rotate all exposed passwords after templating
  3. Never add new plaintext credentials to committed files [Claude Code, 2026-02-19 16:30]