engineeringgpu

MicroScheduler: How We Solved GPU Memory Crashes

February 5, 202612 min readDavid Parkhurst

Here's what happens when you run three LLM agents on one GPU without scheduling: they all try to load models simultaneously, VRAM fills up, and the entire system crashes with an out-of-memory error. Every. Single. Time.

This is the #1 pain point in local AI. And no agent framework solves it because they operate at the wrong level of abstraction. You can't solve GPU scheduling inside a Python library — you need an OS-level service that sits between agents and hardware.

How MicroScheduler Works

MicroScheduler is a FastAPI service (port 8088) that acts as the GPU traffic controller. Every LLM request goes through it. Here's the flow:

Request arrives — Agent sends an inference request with model name, priority, and estimated tokens
VRAM check — MicroScheduler checks current VRAM usage against the model's known footprint
Slot allocation — If a slot is available within VRAM limits, the request proceeds. If not, it's queued.
Priority queuing — Queued requests are ordered by priority (critical > normal > background)
Completion tracking — When inference completes, the slot is freed and the next queued request is dispatched

The VRAM Budget

MicroScheduler maintains a VRAM budget — a live accounting of how much memory each loaded model consumes. When an agent requests a model that would exceed available VRAM, the scheduler has three options:

Queue — Wait for another model to finish and free VRAM
Evict — Unload a lower-priority model to make room
Reject — If the model is too large for the GPU entirely, return an error immediately

This prevents the catastrophic failure mode where two 13B models try to load on a 24GB GPU simultaneously. The scheduler knows the math before the GPU does.

Concurrency Slots

Each GPU profile gets a configured number of concurrent inference slots. An RTX 4090 with 24GB VRAM typically gets 2–3 slots for 7B quantized models, or 1 slot for a 33B model. The slot count is auto-configured based on detected hardware, but can be tuned manually.

The result: zero OOM crashes. Agents wait their turn. The GPU runs at maximum utilization without ever exceeding its limits.

All posts Try AitherOS