Back to blog
engineeringperformanceasynciodebugging

Event Loop Starvation: When Your OS Nervous System Can’t Breathe

March 4, 20269 min readDavid Parkhurst

AitherPulse is the heartbeat of AitherOS. Every second it ticks — checking container health, publishing metrics to Flux, triggering reflexes, updating dashboards. It's the nervous system of an operating system that runs 48 Docker containers across 10 architectural layers. When Pulse is healthy, every service knows about every other service within one second.

One morning, Pulse wasn't healthy. Health checks that should have returned in 15 milliseconds were taking 9 full seconds. The event loop was starving.

The Symptom

A simple curl http://localhost:8081/health should return immediately. Instead, it hung for 9,243 milliseconds. FastAPI was running, the endpoint was registered, the route handler was a one-liner returning a JSON dict. Nothing about the handler itself was slow.

The problem wasn't the handler. The problem was that the handler couldn't get scheduled. Python's asyncio event loop is cooperative — coroutines must yield control voluntarily for other coroutines to run. If something hogs the loop, every pending task waits. Including health checks. Including the /health endpoint that Docker uses to decide if your container is alive.

This is event loop starvation, and we had six independent causes all compounding at once.

Root Cause 1: Synchronous Subprocess in an Async World

The GPU monitoring tick ran every 10 seconds to check VRAM usage via nvidia-smi. The implementation used subprocess.run() — a blocking call that freezes the entire event loop until the child process completes. On a busy system, nvidia-smican take 200–800ms. During that window, zero other coroutines can execute. No health checks, no metrics, no HTTP responses. Nothing.

# BEFORE — blocks the entire event loop
result = subprocess.run(
    ["nvidia-smi", "--query-gpu=memory.used,..."],
    capture_output=True, timeout=5
)

# AFTER — yields control while subprocess runs
proc = await asyncio.create_subprocess_exec(
    "nvidia-smi", "--query-gpu=memory.used,...",
    stdout=asyncio.subprocess.PIPE,
    stderr=asyncio.subprocess.PIPE
)
stdout, _ = await asyncio.wait_for(
    proc.communicate(), timeout=5.0
)

The await is everything. While nvidia-smi runs, the event loop is free to process health checks, handle HTTP requests, and publish metrics. One keyword, 800ms of blocked time eliminated.

Root Cause 2: Unbounded Fire-and-Forget to AitherSense

AitherSense is the perception layer — it ingests metrics, detects anomalies, and feeds the homeostatic feedback loop. Pulse calls Sense on every tick to report service states. The calls used asyncio.create_task() in fire-and-forget mode: launch the HTTP request, don't wait for it, move on.

Sounds efficient. It's catastrophic. If Sense is slow or down, those tasks pile up. Each one holds an HTTP connection, a socket, and a slot in the event loop's task queue. Fifty ticks later, you have 50 zombie tasks all waiting on a dead endpoint, and the event loop is spending more time context-switching between them than doing real work.

# Circuit breaker: max 3 concurrent Sense calls
SENSE_SEMAPHORE = asyncio.Semaphore(3)
SENSE_FAIL_COUNT = 0
SENSE_CIRCUIT_OPEN = False

async def _safe_sense_call(data: dict):
    global SENSE_FAIL_COUNT, SENSE_CIRCUIT_OPEN
    if SENSE_CIRCUIT_OPEN:
        return  # Don't even try
    if not SENSE_SEMAPHORE.locked() or SENSE_SEMAPHORE._value > 0:
        async with SENSE_SEMAPHORE:
            try:
                await http_client.post(sense_url, json=data, timeout=2.0)
                SENSE_FAIL_COUNT = 0
            except Exception:
                SENSE_FAIL_COUNT += 1
                if SENSE_FAIL_COUNT >= 5:
                    SENSE_CIRCUIT_OPEN = True

The semaphore caps concurrency at 3. The circuit breaker stops all attempts after 5 consecutive failures. No more zombie task storms.

Root Cause 3: Flux Publish Flooding

Every Pulse tick publishes events to Flux (the event bus) — service health updates, metric snapshots, reflex triggers. Each publish is an HTTP POST. When you have 48 containers reporting health, that's 48 publishes per tick. At 1Hz, that's 48 concurrent HTTP requests per second, every second, forever.

We added a publish semaphore limiting concurrent Flux publishes to 5:

PUBLISH_SEMAPHORE = asyncio.Semaphore(5)

async def _throttled_publish(event_type: str, data: dict):
    async with PUBLISH_SEMAPHORE:
        await flux_emitter.publish(event_type, data)

Publishes now queue naturally instead of stampeding. The event loop processes them 5 at a time, with breathing room between batches for health checks and other work.

Root Cause 4: Reflex Cascade on SERVICE_HEALTH Events

AitherPulse has a reflex system — reactive behaviors triggered by events. When it publishes a SERVICE_HEALTH event, that event hits Flux, which broadcasts it back to all subscribers, including Pulse itself. Pulse's reflex engine picks it up and runs reflex handlers, which may publish more events, which trigger more reflexes.

A feedback loop. Not infinite — the events have different types downstream — but deep enough to cascade 3–4 levels on every single tick. Each level spawns tasks, makes HTTP calls, and consumes event loop time. Multiply by 48 containers and you get hundreds of unnecessary reflex evaluations per second.

# Skip reflexes for routine health events
if event_type == "SERVICE_HEALTH":
    return  # These are OUR events — don't react to them

One if statement. The cascade dies at the source.

Root Cause 5: Aggressive Flux Polling

The Flux event listener polled for new events every 100 milliseconds. That's 10 HTTP requests per second just to check for events, even when nothing is happening. Each poll is a full HTTP round-trip: connect, send, wait, parse, close. On an event loop already under pressure, 10Hz polling is the difference between starvation and stability.

# BEFORE: 10Hz polling — 10 requests/sec
await asyncio.sleep(0.1)

# AFTER: 2Hz polling — 5x less pressure
await asyncio.sleep(0.5)

Half-second polling is still fast enough to react to events within one tick cycle. The event loop gets 400ms of breathing room per poll cycle instead of 0ms.

Root Cause 6: Missing Yield Points in Hot Loops

Pulse iterates over all 48 containers, checks each one, processes the result, updates internal state, and prepares the next publish — all in a tight loop with no awaitstatements between iterations. In asyncio, no await means no yield. No yield means no other task runs until the loop completes.

# Add yield points so other coroutines can breathe
for container in containers:
    status = await check_container(container)
    update_state(container, status)
    await asyncio.sleep(0)  # Yield to event loop

asyncio.sleep(0) is the minimal yield. It doesn't actually sleep — it just puts the current coroutine at the back of the ready queue and lets everything else run. Adding it after each container check means health endpoints can respond between iterations instead of waiting for all 48 to finish.

The Result

MetricBeforeAfter
Health endpoint response9,243ms15–19ms
Concurrent Sense tasks (peak)50+3 (capped)
Flux publishes per tick48 simultaneous5 concurrent, queued
Reflex evaluations per tick200+~48 (no cascade)
Flux polls per second102
Event loop yield points0 per iteration1 per container
GPU check blocking time200–800ms0ms (async)

Six independent fixes, each targeting a different starvation vector. None of them alone would have solved the problem — the loop was being hit from every direction. But together, they took Pulse from clinically dead to responding in under 20 milliseconds. Consistently. Under load. With all 48 containers running.

Lessons for Async Python

Event loop starvation is death by a thousand cuts. No single line of code was the villain. Each cause was reasonable in isolation: use subprocess.run because it's simpler; fire-and-forget because it's faster; poll frequently because events matter; skip yields because the loop is “short.” Together, they compound into a system that can't serve a health check.

Here's what we now enforce across all AitherOS services:

  1. Never use blocking calls in async code. No subprocess.run(), norequests.get(), no time.sleep(). If it doesn't have await, it doesn't belong in an async function.
  2. Bound all fire-and-forget patterns. Semaphores on concurrent tasks. Circuit breakers on flaky endpoints. If you create a task, you own its lifecycle.
  3. Throttle event publishing. Publishing 48 events simultaneously is a flood, not a feature. Queue them.
  4. Break feedback loops at the source. If you publish an event and subscribe to it, you've built a loop. Detect it. Short-circuit it.
  5. Yield in loops. Any loop that iterates more than ~10 times needs an await asyncio.sleep(0). No exceptions.
  6. Poll lazily. 10Hz is almost never necessary. 2Hz is almost always enough. Your event loop will thank you.

The Bigger Picture

AitherOS runs on the idea that an operating system should be alive — sensing, reacting, adapting. Pulse is the mechanism that makes that possible. When Pulse starves, the entire OS goes blind. Services don't know each other's health, reflexes don't fire, dashboards freeze, and Docker starts killing containers that are actually fine but can't prove it fast enough.

Fixing event loop starvation wasn't just a performance optimization. It was keeping the nervous system alive. And like real nervous systems, the fix wasn't one thing — it was systemic. Every pathway that could block had to be made async. Every amplification loop had to be dampened. Every flood had to be throttled.

The heartbeat is strong now. Fifteen milliseconds, every time.