Cloud Upgrade Pipeline: One Command, Config to Running GPU
Cloud Upgrade Pipeline: One Command, Config to Running GPU
Published by Aitherium — March 25, 2026
We added FP8 KV cache support to our local vLLM stack last week. Context windows doubled. Memory usage dropped. TurboQuant was a clean win locally. Then we looked at our cloud GPUs on Vast.ai and realized: they were still running the old configs. max-model-len 8192. No kv-cache-dtype. The reasoning model that's supposed to handle effort-8 tasks was serving with half the context window and none of the optimizations.
The fix should have been simple: update the config, redeploy. But our ModelDeploymentManager actively blocked redeployment of running models (line 1197: "already active, teardown first"). There was no upgrade path. You had to manually tear down, wait, redeploy, re-register. For every instance.
So we built one.
The Upgrade Pipeline
upgrade_deployment() is atomic: snapshot the old session's identity (profile, served_name, model, backend), tear it down, reload the profile YAML to pick up any config changes, redeploy with fresh settings. One method call. The old instance is gone and the new one is provisioned with whatever cloud_node_profiles.yaml says today.
new_session = await mgr.upgrade_deployment(session_id, tenant_id)
upgrade_all_deployments() iterates every running session sequentially — no parallel provisioning to avoid GPU marketplace contention — and returns a summary: upgraded, failed, skipped.
Two new Genesis endpoints expose this:
POST /deploy/cloud-model/upgrade/{session_id}— single instancePOST /deploy/cloud-model/upgrade-all— bulk with optional profile filter
And a PowerShell script for the automation layer:
.\3073_Upgrade-CloudModels.ps1 -All -WhatIf # Preview changes
.\3073_Upgrade-CloudModels.ps1 -All # Upgrade everything
The -WhatIf flag diffs running config against profile YAML without touching anything. You see exactly what would change before committing.
Five Bugs Between Config and Request
Building the pipeline was the easy part. Getting a request to actually reach the cloud reasoning model exposed five bugs in the path between "config says use deepseek-r1" and "user gets a response from deepseek-r1."
Bug 1: GPU name format mismatch
Our cloud_node_profiles.yaml uses underscored GPU names: RTX_3090. Vast.ai's search API uses spaces: RTX 3090. The search query gpu_name = RTX 3090 broke the CLI parser because the space was treated as a token delimiter. Every marketplace search returned zero results.
Fix: removed the GPU name filter entirely. VRAM + compute capability filtering is sufficient. The name was just cosmetic preference anyway.
Bug 2: Port format assumption
The Vast.ai provider's get_instance() parsed data["ports"] as a list of dicts with PrivatePort/PublicPort keys. But the actual API returns a dict keyed by "22/tcp" with nested binding arrays. Iterating a dict gives you string keys, not dicts. 'str' object has no attribute 'get'.
This crashed wait_for_ready(), which meant instances provisioned successfully but were never detected as ready.
Bug 3: offer_id not passed in async deploy
The async deploy endpoint (POST /deploy/cloud-model) fires a background task that calls deploy_model() — but the background task didn't forward the offer_id parameter. So even when you specified exactly which GPU to rent, the async path ignored it and ran a marketplace search instead.
Bug 4: EffortScaler hardcoded all tiers to orchestrator
The _TIER_MODEL_MAP in EffortScaler mapped every tier — including complex and critical — to aither-orchestrator. The comment said "DeepSeek R1 cannot do structured tool calls." True for tool-calling, but for pure reasoning tasks at effort 7+, the reasoning model is the right choice. The orchestrator can invoke it as a tool when it needs tool calling alongside reasoning.
Bug 5: Agentic path bypassed model selection
This was the sneakiest one. When effort hits 8+, the chat engine promotes the request to agentic mode (AgentRuntime with ReAct loops). The agentic dispatch passed request.model or DEFAULT_MODEL to UCB — but request.model was always None for auto-routed requests, so DEFAULT_MODEL (orchestrator) won every time. The EffortScaler's carefully computed plan.recommended_model = "deepseek-r1:14b" was ignored.
The fix: read the model from the ExecutionPlan before falling back to DEFAULT_MODEL.
The _caller_can_agentic Crash
While testing, we hit an UnboundLocalError: cannot access local variable '_caller_can_agentic'. The agentic upgrade check at line 574 uses _caller_can_agentic, but the variable was only defined 370 lines later in the caller security extraction block. When min_effort=8 forced the agentic upgrade path, it ran before the variable existed.
This is the kind of bug that never appears in normal operation — IntentEngine usually classifies effort below 8 for simple math questions, so the agentic path rarely fires from organic traffic. It only surfaces when you explicitly set min_effort.
SSH Tunnels: The Last Mile
Even after all five bugs were fixed, the cloud vLLM servers were unreachable. Both Vast.ai instances were SSH-only — no direct port mapping. The vLLM server listens on port 8000 inside the container, but there's no public HTTP endpoint.
The solution: SSH tunnels from the host to the cloud instances, binding to 0.0.0.0 so Docker containers can reach them via host.docker.internal:
ssh -N -f -L 0.0.0.0:8201:localhost:8000 root@ssh3.vast.ai -p 12160
Then register the backend with LLMQueue via the new POST /deploy/cloud-model/register-backend endpoint:
{
"name": "vllm_cloud_reasoning",
"base_url": "http://host.docker.internal:8201",
"max_concurrent": 16
}
We also added this endpoint because the deployment pipeline creates SSH tunnels automatically during provisioning, but manually restarted instances need manual registration.
The Result
Before: effort 8 requests always hit the local 8B orchestrator. The reasoning model existed on Vast.ai but was stopped, running old configs, and unreachable.
After: effort 7+ routes to deepseek-r1:14b on a cloud RTX 3090 with FP8 KV cache, 32K context, and 16 concurrent sequences. The orchestrator handles tool calling and quick responses. The reasoning model handles math, analysis, architecture, and anything that needs chain-of-thought depth.
[PLAN] Applied ExecutionPlan: tier=complex, model=deepseek-r1:14b, effort=8
Config changes now flow from cloud_node_profiles.yaml to running instances with one command. No manual teardown. No re-registration. No five-bug scavenger hunt.
Next time, anyway.