Early Access Preview—AitherOS is in active development. Features may change, break, or disappear.

LLM

0/24

GPU0/0GB

IDLEFREE

Connecting to services…

•

Live Demo

Invite Only

Theme

GitHub

Live Demo

Invite Only

Theme

GitHub

Back to blog

infrastructuredeploymentmeshgpusecuritycloudflaredeep-dive

One stdlib File Turns a Rented H100 Into a Sovereign Mesh Node

Name: AitherOS
Author: Aitherium

June 4, 20267 min readAitherium

Here is the moment that mattered:

TOTAL NODES: 2
  node-dgx-spark          NVIDIA GB10        online
  vast-h100-39467543      NVIDIA H100 80GB   online   ← newly onboarded
      hardware: H100 80GB HBM3 · 39233/81559 MB · util 99% · provider=vast_ai · instance_id=39467543
      labels:   tier=sovereign, instance_id=39467543, tenant=platform

The first node is a DGX Spark sitting on a desk. The second is an H100 we rented from vast.ai minutes earlier — a machine on the other side of the public internet that we will give back when the job is done. Both appear in the same node registry, both report online, and the H100 is streaming live GPU telemetry: 39GB of 80GB in use, 99% utilization. That utilization number is a fine-tune that was already running on the box — onboarding it into our mesh didn't touch the workload at all.

No VPN. No Kubernetes join token. No cloning the AitherOS stack onto the rented box. We pushed one Python file and it walked into the mesh on its own.

This is the PSK remote-onboard path, and it now works end-to-end. Here's how.

The path that died

We used to onboard cloud GPU boxes with deploy_aither_client() in the vast.ai provider. It SSHed in, apt-get installed a pile of dependencies, and then did this:

# Download AitherClient
curl -sL https://raw.githubusercontent.com/Aitherium/AitherOS/main/AitherOS/AitherNode/lib/AitherClient.py -o ~/.aither/aitherclient.py

That curl pulls the client from a public GitHub raw URL. The instant the repo went private, that URL started returning 404, and the whole onboard path silently broke — the box would provision, the script would "succeed," and the node would simply never show up in the registry. Pulling code from a public mirror was always a liability anyway: it couples a remote box's ability to join the fleet to the visibility of a source repo.

So we replaced the pull with a push, and we replaced the dependency-heavy client with something that needs nothing.

One file, no pip, no clone

The new client is aither_node_agent.py — about 150 lines, standard library only. No httpx, no pyyaml, no cryptography, no virtualenv. A rented box has Python 3 and curl; that's the entire requirement. The provider SCPs this one file over and runs it. Its whole job is two things: register, then heartbeat.

The heartbeat is a plain loop:

def main():
    st, resp = register()
    node_id = resp.get("node_id")
    if not node_id:
        sys.exit(1)
    ...
    while True:
        st, _ = _post(
            f"/nodes/{node_id}/heartbeat",
            {"node_id": node_id, "capabilities": _capabilities(gpu_info())},
        )
        time.sleep(HB_INTERVAL)   # default 30s

Every 30 seconds it shells out to nvidia-smi, packs the result, and posts it. There's no SDK and no schema library — gpu_info() parses CSV out of nvidia-smi and builds a dict by hand:

out = subprocess.run(
    ["nvidia-smi", "--query-gpu=name,memory.total,memory.used,utilization.gpu",
     "--format=csv,noheader,nounits"],
    capture_output=True, text=True, timeout=10,
).stdout.strip()

That's why the registry shows 39233/81559 MB · util 99% — it's the live nvidia-smi reading from the rented box, refreshed twice a minute.

The handshake is a pre-shared key

The node registry lives on AitherGateway and is exposed to the world through a Cloudflare tunnel at cluster.aitherium.com. A public registration endpoint is exactly the kind of thing that keeps you up at night, so it's gated. The agent signs its node name with HMAC-SHA256 over the PSK:

def _signature():
    """HMAC-SHA256(psk, node_name) when a PSK is present, else a derived token."""
    if PSK:
        return hmac.new(PSK.encode(), NODE_NAME.encode(), hashlib.sha256).hexdigest()
    ...

On the gateway side, /nodes/register refuses anything that doesn't present a valid credential — and it accepts two kinds:

_cluster_psk = (aither.get_secret("AITHER_CLUSTER_TOKEN", "") or "").strip()
if _cluster_psk:
    presented = (body.get("token") or request.headers.get("X-Cluster-Token") or "").strip()
    if presented and presented == _cluster_psk:
        pass                                  # long-lived cluster PSK (back-compat)
    elif _enroll_token_valid(presented):
        _enroll_tokens[presented]["used"] = True   # single-use — consume on success
        _node_tenant = _enroll_tokens[presented].get("tenant", "")
    else:
        raise HTTPException(status_code=401, detail="Invalid or missing cluster/enrollment token")

The long-lived cluster PSK is the back-compat path. The one we prefer is a single-use enrollment token: mint it for a specific tenant, hand it to exactly one box, and it's consumed the moment that box registers. A leaked enrollment token is worthless after first use, and it carries its own tenant attribution — the H100 came up labeled tenant=platform because that's who the token was minted for. That's the difference between a shared house key and a one-time entry code.

The PSK never appears in this post, never appears in a log line, and never lands on disk in plaintext on the gateway — it lives in AitherSecrets and is fetched with get_secret().

The Cloudflare JA3 war story

Here's the bug that ate an afternoon. The first version of the agent used Python's urllib to POST to the gateway. It got a 403 every time. We added a browser User-Agent. Still 403.

Cloudflare wasn't reading the User-Agent. It was fingerprinting the TLS handshake — the JA3 signature of the client's cipher suites and extensions. Python's urllib has a TLS fingerprint that doesn't look like a browser, and no header spoofing changes that, because the fingerprint is computed below HTTP, during the handshake itself. curl, on the other hand, produces a handshake Cloudflare is happy to wave through.

So the agent doesn't use urllib at all. Every request goes through curl as a subprocess:

# cluster.aitherium.com sits behind Cloudflare, which 403s Python's urllib even
# with a spoofed User-Agent (it fingerprints the TLS/JA3 of the client, not just
# the UA). curl gets through cleanly, so route HTTP through curl via subprocess.
def _post(path, payload, headers=None):
    args = ["curl", "-sk", "--max-time", "20", "-X", "POST", GATEWAY + path,
            "-H", "Content-Type: application/json", "-H", f"User-Agent: {_UA}"]
    ...

It's a little ugly to shell out for HTTP from Python. It's also the reason the stdlib-only constraint costs us nothing: we weren't going to install curl_cffi to defeat a fingerprint when the box already ships curl.

What "managed node" actually buys you

Registration isn't a one-time announcement — it's a living entry. Each heartbeat does three things on the gateway:

_registered_nodes[target]["last_seen"] = _dt.now(_tz.utc).isoformat()
_registered_nodes[target]["status"] = "online"
caps = body.get("capabilities")
if isinstance(caps, dict) and caps:
    _registered_nodes[target]["hardware"] = caps   # refresh live GPU readings

last_seen advances every 30s — that's the freshness signal. A node that stops heartbeating goes stale, and the fleet can reclaim or re-onboard it.
status is reasserted as online on every beat.
hardware is refreshed with the latest nvidia-smi snapshot, so the registry's view of VRAM and utilization is never more than 30 seconds old.

Registration also dedups by name, so an agent restart updates the existing entry instead of creating a ghost:

existing_id = next((nid for nid, n in _registered_nodes.items()
                    if n.get("name") == name), None)

And because the agent reports its instance_id, the rented box reconciles cleanly against the provider's fleet view — the same instance_id=39467543 that vast.ai knows it by is the one the mesh tracks.

Staying fresh — the part we're hardening now

The honest edge of this: that while True loop is the entire heartbeat, and right now it's a foreground process. As long as it runs, the node stays online. If the shell that launched it dies, the loop dies with it, and the node goes stale even though the box — and its training job — are perfectly healthy.

So the next step is making the heartbeat survive: detach it from the launching session, restart it if it exits, and re-launch it if the box reboots. The registry's last_seen field is exactly the instrument we verify against — watch it advance, kill the agent, watch it freeze, restart under a supervisor, watch it advance again. A managed node isn't proven by registering once; it's proven by staying fresh when nobody's watching.

The bigger picture

The reason this matters beyond one rented H100: the registration handshake doesn't care where the box is. A vast.ai instance across the WAN, an edge box on a home LAN, a Dell tucked in a closet — they all speak the same two endpoints (/nodes/register, /nodes/{id}/heartbeat) with the same PSK handshake, and they all land in the same registry. For the truly hands-off case there's even a one-command public installer:

curl -fsSL https://cluster.aitherium.com/install.sh | sh -s -- --token <PSK>

Public by design — the security is the key, not the obscurity of the URL. One file, one key, and a machine anywhere becomes something the mesh can see, schedule, and trust.

Files

File	Role
`AitherOS/services/mesh/providers/aither_node_agent.py`	The stdlib-only agent: PSK register + 30s `nvidia-smi` heartbeat, routed through `curl`
`AitherOS/services/core/AitherGateway.py`	`/nodes/register` (PSK + single-use enrollment token gate), `/nodes/{id}/heartbeat`, `/install.sh`, `/agent.py`
`AitherOS/services/mesh/providers/vast_ai.py`	vast.ai provisioning + the now-retired public-GitHub client pull it replaced
`AitherOS/services/mesh/AitherComet.py`	The deployment streaker that drives provider onboarding across the mesh

Enjoyed this post?

All posts Try AitherOS