A GPU in Another Room: Onboarding a Remote Node Through a Tunnel
A GPU in Another Room
Securely connect to a machine, hand it a secret and a client, and have it phone home as a first-class node. No LAN exposure, no open ports — just the bugs in between.
The node you own but can't see
There's a specific kind of frustration in running infrastructure you own but can't see.
We have a DGX Spark — a Grace-Blackwell GB10, 128 GB of unified memory — sitting on the LAN, serving the reasoning model and embeddings all day. It works. Chat routes to it. The fleet uses it. By every functional measure it's part of the system.
And yet, to the platform, it barely existed. It was wired in the way these things always start: a couple of hand-rolled socat proxies forwarding ports to spark.local, and a pool entry in a YAML file. The dashboards didn't know its name. The mesh didn't know it was a node. When you opened the infrastructure view, the box doing a third of the thinking simply wasn't there.
Worse, the config had drifted. The YAML still listed three model backends that had never actually been deployed, pointed the fleet's Prometheus scrape targets at an IP the box hadn't used in months, and defaulted the whole pool to enabled: false. The map and the territory had quietly diverged, and nobody noticed because the one model that was running answered fine.
This is the gap between "a machine that works" and "a node the platform manages." We wanted to close it — properly. Not another socat tunnel. The DGX should be a first-class, observable mesh node: registered, health-checked, heartbeating its real hardware and containers, visible in the dashboard like any local service — and it should get there securely, with no LAN exposure on either side.
That last constraint is the whole story.
The shape of the answer
The mechanism already existed in spirit. A node runs a small cluster agent; the agent registers with the primary's gateway, gets a node id, and heartbeats every 30 seconds with its capabilities and running services. The gateway keeps a registry; the dashboards read it. Standard stuff.
The problem is the connection. For the agent on the DGX to reach the gateway, one of two things has to be true: either the gateway is exposed on the LAN (so the DGX can hit primary-ip:8777), or there's some other path. We checked the gateway's bind: 8777/tcp -> 127.0.0.1:8777. It listens on localhost only. Good — that's the safe default, and we weren't about to punch the control-plane gateway out onto the network just to register one node.
So we used the thing we already trust for everything else public: the Cloudflare tunnel.
The platform already runs a tunnel (cloudflared, in token mode, remotely managed) that fronts every public surface. Adding a node-registration path is just one more ingress route:
- hostname: cluster.aitherium.com
service: https://aitheros-gateway:8777
originServerName: aitheros-gateway
noTLSVerify: true
cloudflared lives on the same Docker network as the gateway, so it reaches aitheros-gateway:8777 directly — even though that port is bound to localhost on the host. The DGX, meanwhile, just talks to https://cluster.aitherium.com over the public internet. No LAN exposure, no inbound ports on the DGX, no inbound ports on the primary. The tunnel is the only door, and it only opens outward.
Push the route config to the Cloudflare API, watch it create the DNS record, and the path is live. The plan was clean. Then we tried to use it.
Bug 1: the 502 that wasn't Cloudflare's fault
First visit to cluster.aitherium.com: 502 Bad Gateway.
A 502 from a tunnel means cloudflared reached the origin and the origin said no. The network was fine (same Docker network, confirmed). So it was the origin — and the clue was in our own route. We'd written service: http://aitheros-gateway:8777. But the gateway logs say:
Uvicorn running on https://0.0.0.0:8777
The gateway speaks HTTPS internally — inter-service TLS, like most of the stack. cloudflared was knocking with plain HTTP on a TLS port, getting a handshake error, and surfacing it as a 502. The fix is the three lines above: https://, plus noTLSVerify: true for the self-signed internal cert, plus originServerName so SNI matches. Re-sync, and cluster.aitherium.com/health returns 200 {"status":"healthy","service":"AitherGateway"}.
One protocol character. Every one of these costs more than it should.
Bug 2: Cloudflare 403s your robot for being a robot
With the path working, we shipped the agent to the DGX and watched its log:
[dgx-agent] registered as... HTTP Error 403: Forbidden
403, not 401 — so it wasn't our auth rejecting it. Something upstream was. We tested from the DGX with curl, varying only one thing:
curl default UA -> 401 (reached the gateway, no token yet)
UA "Python-urllib/3.11" -> 403 (blocked before the gateway)
UA "Mozilla/5.0" -> 401 (reached the gateway)
There it is. Cloudflare's bot protection 403s the default python-urllib User-Agent. Our stdlib agent used urllib, advertised itself honestly as a Python script, and got treated like one. The fix is a one-line header — give the agent a real identity:
"User-Agent": "Mozilla/5.0 (compatible; AitherClusterAgent/1.0; +https://aitherium.com)"
Not a disguise — a stable, attributable UA that doesn't trip a generic bot rule. Restart, and the agent registers. This is the kind of bug you cannot reason your way to from the code; you have to compare two requests that differ by one header and let the responses tell you.
Bug 3: the agent that wouldn't boot on the box it had to run on
We'd intended to run the real cluster agent — the full one. But the DGX is aarch64, and the only AitherOS image on it is the training image. When we tried to import the agent there:
ModuleNotFoundError: No module named 'lib.cluster'
The full agent imports lib.core.AitherHttp — a thin internal HTTP wrapper — which drags in the whole library. Shipping the entire platform lib to a remote node just to register it is the wrong amount of dependency. The agent is talking to a public Cloudflare endpoint with a valid cert; it doesn't need any of our internal TLS plumbing.
So we wrote a stdlib-only agent — urllib, subprocess, json, nothing else. It does exactly what the registry needs: gather GPU (via nvidia-smi), CPU, memory, and the container list (via docker ps); register; then heartbeat every 30 seconds. ~120 lines, runs on any Python 3, no install step. The right client for a remote node is the smallest one that does the job.
Bug 4: a GPU agent in a container that can't see the GPU
We ran the agent as a container with --gpus all. It registered — with gpu: None, cores: None, services: 0. The containerized agent couldn't see nvidia-smi or the host's Docker socket in any useful way on the GB10.
The honest fix was to stop fighting it. The DGX host has nvidia-smi and docker natively. So we run the agent on the host, as a systemd service:
[Service]
EnvironmentFile=/opt/aitheros/cluster.env
ExecStart=/usr/bin/python3 /opt/aitheros/dgx_node_agent.py
Restart=always
RestartSec=5
enable --now, and it's reboot-safe, auto-restarting, and has full visibility into the machine it's reporting on. Sometimes the cleanest deployment isn't a container — it's the unit file that's been doing this job reliably for fifteen years.
Bug 5: a GPU that refuses to report its own VRAM
The host agent registered — and still showed vram: 0. But this time it wasn't us. Ask the GB10 directly:
nvidia-smi --query-gpu=name,memory.total --format=csv,noheader,nounits
> NVIDIA GB10, [N/A]
The GB10 is unified memory — CPU and GPU share one 128 GB pool — so memory.total comes back [N/A]. There's no separate VRAM figure to report. The agent was parsing it correctly; the value genuinely doesn't exist in the form the query expects. The right answer is to surface the unified memory (which /proc/meminfo reports as ~121.7 GB) and label the GPU honestly as NVIDIA GB10. Accurate beats tidy: the node shows a GB10 with 121.7 GB of memory, which is exactly what it is.
Bug 6: the registry that threw away the answer
Final indignity: even with the host agent sending full hardware, the registry showed gpu: None — and two DGX nodes instead of one.
Two bugs in the gateway's register handler, both ours:
- It read
name,host,port,capabilities,labelsfrom the registration body — and silently droppedhardware. The GPU/CPU/memory we were carefully gathering went straight into the void. - It minted a fresh
node_idon every registration. Every agent restart created a new entry, so the registry accumulated ghosts.
Both fixed where they belonged. The register handler now persists hardware, dedups by node name (a re-registering node updates its existing entry instead of spawning a twin), and the heartbeat refreshes hardware and container lists live. Run it, and the registry collapses to one clean truth:
node-dgx-spark | online | NVIDIA GB10 | 20 cores | 121.7 GB | 7 containers
The part you don't see: handing over the secret
Threaded through all of this was the question that has no clever shortcut: how do you give a remote machine a shared secret without leaking it?
The registration endpoint is PSK-gated — a node must present AITHER_CLUSTER_TOKEN or it gets a 401 (which, to be clear, is the gate working: random callers are rejected; the DGX, holding the token, is admitted). That secret has to exist in exactly two places: the gateway, and the node. It must touch neither a command line, nor a log, nor a file on the primary.
So we never let it. The token was generated inside a container and written straight to the vault in one process — it was never printed. The gateway reads it from the vault at request time, not from an env var. And to put it on the DGX, we read it from the vault inside a container and piped it through the SSH connection into a chmod 600 env file on the target — it crossed exactly one encrypted hop, from vault to node, and landed nowhere else. The agent loads it via EnvironmentFile.
When the platform's own safety classifier blocked an earlier, sloppier attempt — one that would have written the token to a temp file and interpolated it on a command line — that was the system catching us doing the wrong thing. Good. The right version is more work and worth it.
So, how do we do remote deployment?
Step back from the DGX and this is the general shape of secure remote onboarding, and it's reusable:
- Connect securely — SSH (key-based), the one door you already trust to a box.
- Hand over a secret without leaking it — generate in the vault, pipe through the SSH channel into a locked-down file on the target. Never on a command line, never on the primary's disk.
- Ship the smallest possible client — a stdlib agent, not the whole platform. Copy it over the same channel.
- Run it as a service — systemd,
Restart=always, reboot-safe. - Phone home through the tunnel, not the LAN — the node registers outbound to a Cloudflare hostname, PSK-authenticated. No inbound ports, no exposed control plane, no flat-network trust.
Connect, hand over a key, push a client, run it, have it report back over a tunnel. That's the pattern — so we paved it. There's now a one-command installer:
AITHER_NODE_TOKEN=<psk> sh -c "$(curl -fsSL https://cluster.aitherium.com/install.sh)"
The script (served by the gateway itself, public — the security is the PSK, exactly like the install scripts you already trust) detects the node's arch and GPU, fetches the stdlib agent, writes the locked-down env, installs the reboot-safe systemd unit, and registers — all of steps 2–5 above, idempotently. We proved it by re-onboarding the DGX through the installer and watching it land as the same single clean node. The DGX was the first one we drove by hand; every node after it is one line.
What's live now
Open the infrastructure dashboard and the DGX is there — not as a diagram, as a node: NVIDIA GB10, 121.7 GB, 20 cores, 7 containers, online, heartbeating every 30 seconds, alongside the local RTX 5090 in a GPU Fleet panel that finally shows the real compute instead of empty cloud slots. It survived a night of heartbeats and a service restart without a hiccup.
The box that was doing a third of the thinking, and that the platform couldn't see, is now a first-class member of it — reached over a tunnel it dials itself, trusted by a secret that never leaked, and honest about being a GB10 that keeps its memory in one big pool.
The work was, as always, 20% architecture and 80% the six bugs between "it should work" and "it works." That ratio never changes. It's the job.