MicroScheduler: How We Solved GPU Memory Crashes
Here's what happens when you run three LLM agents on one GPU without scheduling: they all try to load models simultaneously, VRAM fills up, and the entire system crashes with an out-of-memory error. Every. Single. Time.
This is the #1 pain point in local AI. And no agent framework solves it because they operate at the wrong level of abstraction. You can't solve GPU scheduling inside a Python library — you need an OS-level service that sits between agents and hardware.
How MicroScheduler Works
MicroScheduler is a FastAPI service (port 8088) that acts as the GPU traffic controller. Every LLM request goes through it. Here's the flow:
- Request arrives — Agent sends an inference request with model name, priority, and estimated tokens
- VRAM check — MicroScheduler checks current VRAM usage against the model's known footprint
- Slot allocation — If a slot is available within VRAM limits, the request proceeds. If not, it's queued.
- Priority queuing — Queued requests are ordered by priority (critical > normal > background)
- Completion tracking — When inference completes, the slot is freed and the next queued request is dispatched
The VRAM Budget
MicroScheduler maintains a VRAM budget — a live accounting of how much memory each loaded model consumes. When an agent requests a model that would exceed available VRAM, the scheduler has three options:
- Queue — Wait for another model to finish and free VRAM
- Evict — Unload a lower-priority model to make room
- Reject — If the model is too large for the GPU entirely, return an error immediately
This prevents the catastrophic failure mode where two 13B models try to load on a 24GB GPU simultaneously. The scheduler knows the math before the GPU does.
Concurrency Slots
Each GPU profile gets a configured number of concurrent inference slots. An RTX 4090 with 24GB VRAM typically gets 2–3 slots for 7B quantized models, or 1 slot for a 33B model. The slot count is auto-configured based on detected hardware, but can be tuned manually.
The result: zero OOM crashes. Agents wait their turn. The GPU runs at maximum utilization without ever exceeding its limits.