Early Access Preview
Back to blog
architecturegpuvramreasoningllmtool-callingmulti-userdeep-dive

Reasoning as a Tool: Why We Offloaded Deep Thinking to a Remote GPU

March 18, 20267 min readAitherium
Share

We run multiple GPU workloads on one card: the orchestrator model (Nemotron 8B, always loaded), a reasoning model (DeepSeek-R1:14b, loaded on demand via VLLMSwap), a vision model (Qwen2.5-VL, loaded for image analysis), and FLUX for image generation. MicroScheduler coordinates all of this — evicting reasoning when FLUX needs to generate, swapping the reasoning model in when a deep thinking request arrives, managing priority queues so nothing starves. For a single user, it works beautifully. VLLMSwap is fast, the priority system is smart, and the experience feels seamless.

The problem is that AitherOS doesn't serve a single user anymore.

The Multi-User Problem

When one person is using the platform, VRAM swapping is invisible. You ask a question, the orchestrator answers instantly. You ask something hard, VLLMSwap loads the reasoning model, thinks for 30 seconds, gives you a thorough answer. You ask for an image, FLUX loads, generates, done. Sequential requests, one model swap at a time.

Now put three users on the system simultaneously:

  • User A asks a casual question → orchestrator handles it, instant
  • User B asks "analyze this race condition in my lock-free queue" → VLLMSwap starts loading the reasoning model into VRAM
  • User C pastes a screenshot → vision model needs to load, but VRAM is busy with the reasoning swap
  • User A asks a follow-up → orchestrator is still loaded, responds... but now User B's reasoning swap completes, and User C is still waiting
  • User C's vision load starts → but User B's reasoning model is occupying the VRAM slot
  • User A asks for an image → FLUX needs VRAM, but reasoning is still loaded from User B's request

On a 32GB GPU (RTX 5090), the math looks like this:

WorkloadVRAMWhen
Windows desktop + CUDA contexts~3 GBAlways
Orchestrator (Nemotron 8B AWQ)~5 GBAlways loaded — never unloaded
Reasoning (DeepSeek-R1:14b AWQ)~10 GBEffort 7+ tasks
Vision (Qwen2.5-VL-7B)~8 GBImage analysis
FLUX fp8 (image generation)~16 GBOn demand

The orchestrator is always loaded — it handles every single message and must respond instantly. It never gets evicted, never sleeps. That's non-negotiable.

But with the orchestrator permanently occupying its slot, you have ~24 GB left for reasoning, vision, and FLUX. You can fit one at a time. Two is impossible. And with multiple users issuing different kinds of requests, you get a queue of model swaps: load reasoning, unload reasoning, load vision, unload vision, load FLUX, unload FLUX. Each swap is 15-30 seconds. User C's screenshot analysis is waiting behind User B's reasoning swap which is waiting behind the previous FLUX eviction.

The single-user experience was excellent. The multi-user experience was a traffic jam.

Why Reasoning Is the Right One to Offload

Not all workloads are equal:

  • Orchestrator: Must be instant. Every single chat message goes through it. Always loaded, never evicted. Non-negotiable.
  • Vision: Fast turnaround expected. Users paste a screenshot and want analysis in 2-3 seconds, not 30.
  • Image generation: Users already expect 10-30 seconds. A network round-trip doesn't change the perceived experience.
  • Reasoning: Deep thinking tasks already take 10-60 seconds. Users asking "analyze this race condition" or "think through this architecture" expect to wait. Adding 100ms of network latency to a 30-second reasoning chain is invisible.

Reasoning is also the least frequent workload. Most chat is effort 1-6 (orchestrator handles it directly). Only effort 7+ tasks trigger deep reasoning, and those are maybe 5-10% of all requests.

But here's the key insight for multi-user: reasoning is the workload that blocks everyone else the longest. A reasoning swap takes 15-30 seconds to load, then the model runs for another 10-60 seconds, then it needs to be evicted before vision or FLUX can load. That's potentially 90 seconds where User B's reasoning request is monopolizing the VRAM swap slot, and Users A and C are waiting for vision/FLUX access.

Offloading reasoning removes the longest-blocking, least-frequent workload from the swap queue entirely. Now vision and FLUX only contend with each other — a two-way negotiation instead of a three-way traffic jam.

The Insight: Reasoning Is Just Another Tool

Our chat pipeline already supports tool calling. The orchestrator can decide to generate an image, analyze a screenshot, or search the web. It does this through the standard OpenAI function-calling interface — the model returns a tool_calls response, we execute the tools, fold the results back in, and the model continues with the new information.

Reasoning fits this pattern perfectly. The orchestrator doesn't need to be the reasoning model. It needs to be able to call a reasoning model when the problem requires it. Just like it calls FLUX when it needs an image.

User B: "Analyze the race condition in this lock-free queue implementation"
  |
IntentEngine: effort=8, needs_tools=true
  |
EffortScaler: deep_reasoning=true, enable_reasoning_tool=true
  |
UCB.think() → offers tools: [generate_image, analyze_image, web_search, reason]
  |
Orchestrator (local, always loaded):
  "This needs deep analysis. Let me use the reason tool."
  → tool_call: reason(problem="analyze race condition in...", context="<code>")
  |
_execute_reason():
  → POST to AITHER_REASONING_ENDPOINT (rented GPU or cloud API)
  → Returns: detailed chain-of-thought analysis
  |
Meanwhile: User A gets instant chat responses, User C's vision request loads without waiting
  |
Orchestrator: folds reasoning result back in, synthesizes final answer
  → "Based on the analysis, the race condition occurs when..."

The orchestrator stays in control. It decides when to reason deeply. The reasoning model is a specialist it can consult over the network — not a local process competing for VRAM and blocking other users' requests.

What Changes for Multi-User

Before (single-user optimized)After (multi-user ready)
Reasoning swap blocks VRAM for 15-90sReasoning is a network call — zero VRAM impact
Vision and FLUX queue behind reasoning swapsVision and FLUX only contend with each other
3-way VRAM negotiation (reasoning + vision + FLUX)2-way negotiation (vision + FLUX)
User B's deep thinking blocks User C's screenshotAll users served concurrently
Orchestrator conservatively allocated to leave roomOrchestrator gets more headroom for concurrent requests

The orchestrator can now handle more concurrent sequences because it's not sharing VRAM pressure with a reasoning model that might load at any moment. More KV cache means longer conversations, bigger code blocks, and richer system prompts — for every user simultaneously.

Implementation

The reason tool lives alongside the existing conversational tools in the chat pipeline:

REASON_TOOL = {
    "type": "function",
    "function": {
        "name": "reason",
        "description": "Perform deep chain-of-thought reasoning about a complex problem...",
        "parameters": {
            "type": "object",
            "properties": {
                "problem": {"type": "string"},
                "context": {"type": "string"},
            },
            "required": ["problem"],
        },
    },
}

The executor has a three-tier fallback:

  1. Remote vLLM endpoint — A rented GPU (Vast.ai, RunPod, etc.) running DeepSeek-R1. Set AITHER_REASONING_ENDPOINT and you're done. Dedicated hardware, no local VRAM impact, no blocking other users.

  2. Cloud provider — If no remote endpoint is configured, the CloudProviderRouter picks the best available reasoning-tier model (Gemini Pro, Claude, etc.) based on configured API keys.

  3. Local MicroScheduler — Last resort. If the reasoning model happens to be loaded locally, MicroScheduler routes to it. This is the "beefy hardware" path for machines with 48GB+ VRAM where swapping isn't an issue even with multiple users.

When Does the Tool Get Offered?

Two triggers:

Signal-based: The message contains reasoning patterns — "analyze", "edge case", "race condition", "prove", "step by step", "trade-off". Same detection approach as image generation signals.

Plan-based: The EffortScaler sets deep_reasoning=true and enable_reasoning_tool=true for reasoning intents at effort >= 7. This flows through the ExecutionPlan into the tool selection logic.

The model decides whether to actually call the tool. We offer it; the orchestrator's own judgment determines whether the problem warrants deep reasoning or whether it can handle it directly. Most of the time, it handles it. For the hard stuff, it phones a friend.

The Bigger Picture

This is the pattern we keep landing on: the orchestrator is a router, not a monolith. It doesn't need to be the smartest model in the room. It needs to know who to ask.

  • Need an image? Call FLUX.
  • Need to see a screenshot? Call vision.
  • Need current information? Call web search.
  • Need to think deeply? Call reasoning.

Each specialist runs where it makes sense — FLUX on the local GPU (latency-sensitive, VRAM-hungry), reasoning on a rented cloud GPU (latency-tolerant, infrequent), vision wherever there's capacity. The orchestrator stays small, fast, always loaded, and makes routing decisions.

The old VLLMSwap system was a remarkable piece of engineering — it squeezed four workloads onto one GPU and made it feel seamless for a single user. But when you go from one user to many, the problem changes from "can we fit everything?" to "can we serve everyone without blocking?" Reasoning-as-a-tool is that answer: an 8B orchestrator that can consult a 70B reasoning model over the network serves more users, with less latency, than a local 14B model competing for VRAM with every other workload on the card.

The system got smarter by getting smaller — and faster for everyone by offloading the one thing that nobody needs to be fast.

Enjoyed this post?
Share