Early Access Preview
Back to blog
architecturegpuinferencenemotron-elasticollamahybridvramdeep-dive

Returning to Nemotron-Elastic with Some New Tricks

March 20, 20268 min readDavid Parkhurst
Share

Aether (also spelled as ether, aither, or aether) is the "material that fills the region of the universe above the terrestrial sphere" — or at least it was according to ancient science. Aether was the Fifth Element, quintessence, and the concept was used in theories explaining everything from gravity to light. The origins trace back to ancient Greece, where aether was thought to be the "pure essence" that the gods breathed — a heavenly form of air created only in the heavens. Mortals breathe oxygen. Gods breathe aether.

We named our system after it. Not because we think we're building something divine — but because the original concept of aether was about the invisible medium that connects everything. The substrate through which all forces propagate. That's what an AI operating system is: the invisible layer that connects models, tools, memory, and reasoning into something that feels like a single mind.

Anyway, here is my private AI hybrid cloud stack, or whatever you might call it.

The Nemotron-Elastic Experiment (Take Two)

We first tried NVIDIA's Nemotron-Elastic back in January. The pitch: one 12B checkpoint, three model sizes (6B/9B/12B) via zero-shot activation slicing. Download once, serve three. The hybrid Mamba2/Transformer architecture means fast inference with subquadratic attention costs. We sliced the variants, loaded the 6B alongside our orchestrator, and watched it work beautifully for about ten minutes before the VRAM ran out.

VariantReasoning ScoreVRAM (BF16)Our Use Case
nemotron-elastic:6b70.61~6 GBReflex tier — neurons, background tasks
nemotron-elastic:9b75.95~9 GBAgent tier — multi-step tool usage
nemotron-elastic:12b77.41~12 GBReasoning tier — complex analysis

The problem wasn't the model. The problem was trying to fit it on the same GPU as everything else.

What Lives on the GPU

Our RTX 5090 has 32 GB of VRAM. Here's what's already there:

  • Nemotron-Orchestrator-8B (always-on, ~13 GB with KV cache): The brain. Handles every user message, every tool call, every routing decision. This model never sleeps.
  • Nomic embeddings (~1 GB): Always-on for semantic search.
  • Vision model (Qwen2.5-VL, ~8 GB): Swaps in on demand for image analysis.
  • ComfyUI/FLUX (~12 GB): Swaps in for image generation.
  • Windows desktop: ~3 GB because apparently Windows needs GPU memory to render a taskbar.

The orchestrator at 13 GB plus embeddings at 1 GB plus Windows at 3 GB = 17 GB committed. That leaves 15 GB free, which sounds like plenty until you realize vision needs 8 GB and FLUX needs 12 GB and they swap with each other through MicroScheduler's VRAM coordination system.

There's no room for elastic. Not the 12B (24 GB BF16), not the quantized 12B (8 GB but only 0.77 GB left for KV cache — literally unusable), not even the sliced 6B alongside everything else. Every configuration either starved the model of KV cache or required sleeping the orchestrator, which we absolutely cannot do because it handles all user-facing chat.

The Answer Was Sitting in the Next DIMM Slot

128 GB of DDR5 RAM. Just sitting there. Running Docker containers and browser tabs.

Ollama already had nemotron-elastic:12b pulled from January (7.5 GB in Q4_K_M quantization). All three variants — 6b, 9b, 12b — were already cached. Ollama was already running as a service. The model was already registered in our tool system.

We'd been fighting VRAM budgets for hours when the answer was OLLAMA_NUM_GPU=0.

CPU-only inference. Zero VRAM consumed. The orchestrator calls elastic as a tool through the exact same invoke_model function it uses for everything else. The user never knows the difference.

The Stack

GPU (RTX 5090, 32 GB)
  Orchestrator   port 8120   always-on    ~13 GB
  Embeddings     port 8209   always-on    ~1 GB
  Vision         on-demand                ~8 GB
  ComfyUI/FLUX   on-demand                ~12 GB

CPU (Ryzen 9 9950X3D, 128 GB DDR5)
  Nemotron-Elastic-12B   Ollama   tool backend   ~7.5 GB RAM
  (6B/9B/12B nested in one checkpoint)

Cloud (Vast.ai, 2x RTX 3090)
  DeepSeek-R1:14b        reasoning        max quality
  MiroThinker-1.7-mini   research         deep research

The orchestrator is always the one talking to the user. When it needs deeper reasoning:

  • Effort 7-8: Calls invoke_model(preset="elastic") — Nemotron-Elastic on CPU
  • Effort 9-10: Calls deep_reasoning() — DeepSeek-R1 on cloud GPU

Both are tool calls. The orchestrator formulates a specific question, invokes the model, gets text back, incorporates it into its response. Same pattern for both, different backends.

Swappable Stacks (Because We Kept Breaking Things)

After the third time we accidentally started the wrong container and crashed the orchestrator's VRAM, we built a proper configuration system. config/model-stacks.yaml defines four named stacks:

cloud-offload (current default): Orchestrator on GPU, everything else on cloud. Simplest config. Maximum VRAM headroom.

elastic-hybrid (the one this post is about): Orchestrator on GPU + Nemotron-Elastic on CPU via Ollama + cloud for max-quality reasoning.

ollama-hybrid: Ollama handles both reflex (llama3.2:3b on CPU) and reasoning (elastic:12b on CPU). GPU only runs the orchestrator. Good when GPU is saturated.

ollama-only: Everything on CPU. GPU free for training, ComfyUI, or whatever. Cloud handles the hard stuff.

Switching is one command:

curl -X POST localhost:8001/model-stacks/switch \
  -d '{"stack": "elastic-hybrid"}'

The ModelStackManager handles everything automatically: stops containers that aren't needed, pulls Ollama models if they're missing, hot-reloads the IntentEngine routing (zero downtime), and persists the selection. We also wrote PowerShell automation scripts because AitherZero runs our ops:

.\5021_Switch-ModelStack.ps1 -List
.\5021_Switch-ModelStack.ps1 -Stack elastic-hybrid
.\5021_Switch-ModelStack.ps1 -Status

Performance

BackendTokens/secFirst tokenVRAMKV CacheUse case
Orchestrator (GPU, vLLM)~50 tok/s<100ms14.6 GB (0.45 util)59K tokensAll chat, tools, routing
Elastic 12B (CPU, Ollama)~8-12 tok/s~500ms0 GBunlimited (RAM)Local reasoning tool
DeepSeek-R1 (Cloud, Vast.ai)~40 tok/s~200ms + RTT0 GBcloud-managedMaximum quality reasoning

With elastic off the GPU, we bumped the orchestrator from 0.40 to 0.45 utilization — a 47% increase in KV cache (59K tokens vs ~40K) and 1.44x concurrent conversations at full 40K context. That headroom came directly from not wasting VRAM on a model that runs perfectly fine on DDR5.

What does 59K tokens of KV cache mean in practice? It depends on conversation length:

  • 1-2K tokens (typical chat): 24 concurrent sessions (vLLM batch cap)
  • 4K tokens (detailed Q&A): ~14 concurrent sessions
  • 8K tokens (code analysis): ~7 concurrent sessions
  • 40K tokens (full context): 1.44 concurrent sessions

For a platform serving multiple demo users simultaneously, 24 concurrent short conversations is the difference between "works great" and "queue depth: 12, try again later." At 0.40 we were closer to 16 concurrent short sessions. That's a meaningful improvement from five lines of config.

CPU inference is 4-6x slower than GPU. For a tool backend where the orchestrator is already streaming its response while waiting, that's fine. The user sees the typing indicator. They don't see that three models are collaborating behind it.

What Elastic Actually Does in This Stack

Nemotron-Elastic isn't a general-purpose chat model in our system. It's a reasoning tool. The orchestrator uses it for:

  • Deep analysis: "Analyze this codebase for security vulnerabilities" — orchestrator identifies the task needs depth, calls elastic with the relevant code context
  • Mathematical reasoning: Elastic's Mamba2/Transformer hybrid excels at structured reasoning chains
  • Background neurons: Our neuron system fires parallel context-gathering tasks. Elastic handles the ones that need reasoning, not just retrieval
  • Draft generation: When the orchestrator needs a long-form response drafted before it applies its own tool-calling context

The orchestrator decides when to use elastic the same way you'd decide when to consult an expert: not for every question, just for the ones that benefit from a second opinion.

Under the hood, the orchestrator's reason tool has a four-strategy execution chain that tries backends in priority order:

reason(problem="...")
  → Strategy 1: Cloud GPU (Vast.ai DeepSeek-R1:14b)
  → Strategy 2: Ollama Elastic (CPU, nemotron-elastic:12b)
  → Strategy 3: Cloud API (Gemini/Claude fallback)
  → Strategy 4: Local MicroScheduler (last resort)

The orchestrator doesn't choose the backend. It calls reason, and the execution chain picks the best available one automatically. If the cloud GPU is busy or down, elastic handles it locally. If Ollama isn't running, it falls through to cloud API. The orchestrator's job is deciding when to reason, not where.

Why Not Just Use Cloud for Everything?

We have two RTX 3090s rented on Vast.ai running DeepSeek-R1. That's our maximum-quality reasoning path. So why bother with local CPU inference?

  1. Latency: Cloud round-trip is 200-400ms before the first token. Elastic on local CPU starts generating in ~500ms, but there's no network jitter, no queue, no cold start.
  2. Cost: Vast.ai bills by the hour. For effort 7-8 tasks (moderate reasoning, not maximum quality), burning cloud GPU time is wasteful when local CPU handles it fine.
  3. Availability: Cloud instances can go down, get preempted, or hit quota limits. Local CPU is always there.
  4. Privacy: Some reasoning tasks involve sensitive context. Keeping them on local hardware means they never leave the machine.

The elastic-hybrid stack gives us three tiers of reasoning quality that degrade gracefully: local CPU (fast, free, always available) -> cloud GPU (high quality, paid, usually available) -> fallback to orchestrator (good enough, instant).

Lessons Learned

Don't fight for VRAM. We spent hours trying to fit elastic on the GPU through increasingly creative quantization schemes. The answer was using the resource that was actually abundant: RAM.

Tool backends don't need to be fast — they need to be reliable. The orchestrator is the fast path. Everything else is a tool call, and tool calls are allowed to take a few seconds.

Make infrastructure decisions reversible. The model-stacks system means we can switch between four completely different inference architectures in one API call. No rebuilding containers, no editing configs, no praying.

The orchestrator never sleeps. This is the one rule we violated three times before making it physically impossible. The orchestrator handles all user-facing interaction. Everything else is a tool it invokes. If you have to choose between sleeping the orchestrator and running inference on CPU, choose CPU.

Mortals breathe oxygen. The orchestrator breathes tokens. And it never stops.

Enjoyed this post?
Share