There is an uncomfortable truth about running LLMs on your own hardware: the GPU is idle most of the time. Not "mostly idle." Dramatically, embarrassingly idle. We measured our RTX 5090 over a 7-day period and the utilization chart looked like a heart monitor — brief spikes when someone sends a message, then long flatlines of nothing. Average utilization: 11%. Peak: 94%. Median: 3%.

That's 32GB of the fastest memory on Earth and 21,760 CUDA cores doing literally nothing for 89% of the day.

We looked at this and saw a Bitcoin mine. Not for cryptocurrency — for intelligence. Every idle second is a training opportunity. Every conversation that already happened is training data. The model served 400 requests today. It could have learned from all of them, and it didn't, because the GPU was too busy doing nothing.

So we built a system that watches GPU utilization in real time, detects when the GPU goes idle, and fires micro-training batches that improve the model from its own traffic. When a request comes in, training pauses in under one poll cycle. No dropped requests. No added latency. The model just gets better, continuously, for free.

This is the full architecture.

The State Machine

At the heart of the system is IdleGPUTrainingScheduler — a finite state machine with 5 states that models the GPU's current activity:

UNKNOWN → WARMING → IDLE → TRAINING → BUSY

State	Meaning
UNKNOWN	Just started, no data yet
WARMING	GPU util dropped below 15% — timer started
IDLE	Stayed below 15% for 5 minutes — eligible to train
TRAINING	Actively running a micro-training batch
BUSY	GPU util ≥ 40% — training paused, inference has priority

The transitions are designed around one principle: inference always wins. If GPU utilization spikes above 40% at any point — even mid-training — the scheduler immediately clears a pause event and the trainer yields the GPU. There is no "let me finish this batch" grace period. The pause is immediate.

# Busy detection — immediate, overrides everything except active training
if gpu_util >= busy_threshold:
    if old_state == GPUState.TRAINING:
        # Signal the training task to pause
        self._pause_event.clear()
        logger.info(f"[IDLE-TRAIN] GPU busy ({gpu_util:.0f}%), pausing active training")
    self.state.gpu_state = GPUState.BUSY
    self.state.idle_since = None
    self.state.consecutive_batches = 0
    return GPUState.BUSY

The state machine polls every 15 seconds via AitherWatch (our GPU monitoring plugin) or falls back to local pynvml if the monitoring service is down. The 15-second poll means worst-case latency impact is one poll cycle — 15 seconds of shared GPU between training and an incoming request. In practice, training batches are small enough (50-100 steps) that they barely register.

The Cautious Zone

There's an important design subtlety between 15% and 40% utilization. This is the "cautious zone" — above idle threshold but below busy threshold. The system treats it conservatively:

If you're WARMING and util rises into the cautious zone → reset to BUSY. The idle timer restarts.
If you're IDLE and a brief spike enters the cautious zone → stay IDLE. Brief spikes shouldn't abort a training opportunity.
If you're TRAINING and util enters the cautious zone → continue training. Only ≥40% triggers a pause.

This asymmetry is intentional. Getting into the IDLE state is hard (5 minutes of sustained low utilization). Leaving it is easy (any spike above 40%). This means training only starts when we're confident the GPU is genuinely idle, and stops the moment there's real work to do.

The Configuration

Every threshold is tunable. Here are the defaults:

DEFAULT_CONFIG = {
    # Detection thresholds
    "poll_interval_seconds": 15,
    "idle_threshold_percent": 15,
    "idle_window_seconds": 300,        # 5 minutes of sustained idle
    "busy_threshold_percent": 40,
    "min_free_vram_mb": 4096,          # Need 4GB free

    # Training budget
    "max_daily_steps": 2000,
    "max_daily_batches": 40,
    "cooldown_after_train_seconds": 120,  # 2 min between batches
    "max_consecutive_batches": 3,

    # Night mode
    "night_mode_enabled": True,
    "night_hours_start": 1,            # 1 AM
    "night_hours_end": 6,              # 6 AM
    "night_idle_window_seconds": 60,   # Only 1 min idle needed
    "night_max_consecutive_batches": 10,

    # Jitter
    "jitter_seconds": 30,
}

The 4GB VRAM check is critical. Even when the GPU is idle (not running inference), vLLM's KV cache and model weights occupy most of VRAM. Training needs working memory for gradients, optimizer states, and activations. With LoRA rank 8 on an 8B model, the actual training overhead is about 1.5GB — well within the 4GB budget. But we check anyway, because running out of VRAM mid-training would crash both training and the serving process.

Night Mode: The Aggressive Shift

Between 1 AM and 6 AM, the system gets significantly more aggressive:

Idle window drops from 5 minutes to 1 minute. At 2 AM, if the GPU has been idle for 60 seconds, that's enough confidence to start training. Nobody's chatting at 2 AM.
Consecutive batch limit rises from 3 to 10. During the day, we train at most 3 batches per idle window to minimize risk of interfering with a sudden burst of requests. At night, we'll train 10 batches straight.
All other thresholds stay the same. The 15% idle threshold and 40% busy threshold don't change. If something actually needs the GPU at night (a scheduled backup, a cron job, a batch inference), training still yields immediately.

This means a 5-hour night window with a quiet GPU can produce 50 training batches × 100 steps = 5,000 gradient steps. That's a meaningful amount of adaptation — roughly equivalent to fine-tuning on 2,500 examples.

The Three-Way Rotation

When the scheduler decides it's time to train, it doesn't always train the same thing. There are three training pipelines, and they rotate based on the batch number:

batch_mod = self.state.consecutive_batches % 3

# Batch 0 → Specialist pretraining (42M param model from scratch)
# Batch 1 → Per-agent LoRA training (agent skill specialization)
# Batch 2 → General LoRA micro-training (Nemotron-8B)

Each slot falls through if its pipeline has no work. If the specialist model is fully trained, batch 0 falls through to agent training, then to general LoRA. No idle time is wasted.

Pipeline 1: Specialist Pretraining

We're training a ~42 million parameter transformer from scratch. Not a fine-tune — a fresh model built on the AitherOS operational corpus. It provides:

Fast intent routing (<1ms vs 100ms+ from the 8B model)
Semantic anomaly detection (flag unusual patterns without an LLM call)
Domain-native embeddings (trained on your data, not Wikipedia)
Offline fallback generation (basic responses when the primary model is down)

Each idle trigger runs 10 training steps. Over days of accumulated idle time, the specialist gradually converges. The system tracks progress as a completion percentage and emits events for the dashboard.

Pipeline 2: Per-Agent LoRA Training

AitherOS runs 48+ agent identities, each with different specializations. The per-agent training pipeline picks the highest-priority agent that's due for training (based on a cooldown timer and a priority queue from AgentModelManager) and runs a focused LoRA training session on that agent's accumulated conversation data.

The result: each agent develops its own micro-specialization over time. The security agent gets better at security analysis. The code review agent gets better at finding bugs. The writing agent develops a more consistent voice. All from idle GPU cycles.

Pipeline 3: General LoRA Micro-Training

This is the main adaptation pipeline. The MicroTrainingDaemon runs one "tick":

Harvest examples from 4 sources: session learnings (failure patterns), successful conversations, daydream/personality data, and knowledge graph patterns
Filter by quality floor (0.6 minimum)
Deduplicate by instruction hash
Distill through a reasoning model (ExperienceDistiller synthesizes proper teaching material from raw logs)
Train a LoRA adapter: rank 8, alpha 16, learning rate 1e-6, max 100 steps
Evaluate on 5 held-out samples
Rollback if loss regresses by more than 0.1 (EMA tracking)
Hot-reload the new adapter into the running vLLM server — zero downtime

The hot-reload is the magic. After training, the system POSTs to vLLM's LoRA swap endpoint with the new adapter path. The running inference server picks up the new weights without restarting, without dropping any in-flight requests, without any user-visible interruption. The model just got 0.1% smarter, and nobody noticed.

The Micro-Training Engine

The MicroTrainingDaemon is deliberately conservative. This is not hyperparameter-optimized large-scale training. This is continuous, tiny, risk-averse adaptation:

MICRO_TRAINER_DEFAULTS = {
    "micro_batch_size": 50,           # 50 examples per batch
    "min_examples_to_train": 10,      # Don't bother with less than 10
    "learning_rate": 1e-6,            # Very conservative
    "max_steps_per_batch": 100,       # Quick passes
    "lora_rank": 8,                   # Small adapter delta
    "lora_alpha": 16,                 # 2× rank
    "quality_floor": 0.6,            # Minimum quality score
    "rollback_on_regression": True,   # Revert if eval drops
    "max_vram_mb": 4000,             # 4GB VRAM cap
    "max_adapter_versions": 10,       # Keep last 10 checkpoints
    "eval_samples": 5,               # Quick eval post-training
    "regression_threshold": 0.1,     # Max loss increase before rollback
    "distill_enabled": True,         # LLM-distill raw data first
}

Learning rate 1e-6. That's three orders of magnitude below typical fine-tuning. The model nudges. It doesn't jump. A single bad batch can't cause catastrophic forgetting because the step size is too small to move far from the original weights.

The LoRA rank of 8 means only 0.01% of the model's parameters are trainable. The vast majority of the model is frozen. Training modifies a tiny additive overlay — like pencil notes in the margins of a textbook. If the notes are bad, you erase them (rollback) and the textbook is unchanged.

The Distillation Step

Raw conversation logs make terrible training data. "User said X, model said Y" captures the surface but not the reasoning. The ExperienceDistiller pipes raw batches through a reasoning model that synthesizes teaching material:

Extracts the principle behind a successful response, not just the response
Identifies the failure mode in unsuccessful responses, not just the error
Converts multi-turn dialogues into single-turn instruction-response pairs
Filters out conversations that were too simple or too ambiguous to learn from

This quality amplification step is what makes continuous training viable. Without it, you'd be training on noise. With it, every batch carries signal.

Loss Tracking and Rollback

Loss is tracked as an exponential moving average with alpha 0.3:

self._loss_ema = alpha * new_loss + (1 - alpha) * self._loss_ema

If a new batch's loss exceeds the EMA by more than the regression threshold (0.1), the adapter is deleted and the version rolls back:

if run.final_loss and self._loss_ema is not None:
    if run.final_loss > self._loss_ema + self.config["regression_threshold"]:
        # Rollback: delete this adapter, revert to previous version
        shutil.rmtree(adapter_dir)
        self._current_version -= 1

In practice, rollbacks are rare (~2% of batches). The conservative learning rate and quality filtering handle most cases. But when they do fire, they provide a hard safety net: the model cannot persistently degrade from continuous training.

Warm Start

On startup, the daemon loads up to 100 previously distilled examples from disk (last 3 days of distilled JSONL files). This means it doesn't start cold — the first idle-triggered training tick has a full buffer and can fire immediately without waiting to harvest new data.

The Jitter Problem

In multi-GPU or multi-instance setups, multiple idle trainers might detect idle simultaneously and compete for resources. We add 0-30 seconds of random jitter before starting training:

jitter = random.uniform(0, self.config.get("jitter_seconds", 30))
await asyncio.sleep(jitter)

# Post-jitter GPU re-check — crucial
gpu_util, vram_free = await self._poll_gpu()
if gpu_util >= self.config["busy_threshold_percent"]:
    return None  # GPU went busy during jitter, abort

The post-jitter re-check is the important detail. The GPU might have gone busy during the 30-second jitter delay. Without the re-check, you'd start training on a busy GPU. With it, you catch the case and abort cleanly.

The Parallel Idle Task System

Training isn't the only thing that benefits from idle GPU time. The SchedulerLoop registers 5 additional idle tasks through the SlotPool dispatcher:

Task	Category	Cooldown	Purpose
Memory consolidation	GPU	10 min	Compact and merge short-term memories
CodeGraph reindex	CPU	30 min	Incremental code analysis reindex
Knowledge graph maintenance	GPU	1 hour	Prune stale nodes, compact embeddings
Embedding warmup	GPU	15 min	Pre-cache embeddings for common queries
Training data harvest	GPU	2 hours	Collect and prepare new training examples

These tasks use a separate idle detection mechanism (SlotPool's WorkCategory system with a 25% utilization threshold) and run alongside the training scheduler without interfering. They're cooperatively cancellable — each handler checks a cancel_event periodically and exits cleanly when preempted.

The Dashboard

Everything is observable. AitherEvolutionCore exposes REST endpoints for every aspect of the training system:

/loops/status — All 9 background loop statuses, run counts, health
/idle-training/status — Current GPU state, daily budget remaining, total GPU hours reclaimed
/idle-training/trigger — Manual trigger that bypasses idle detection (for testing)
/microtraining/status — Adapter version, loss history, buffer state
/microtraining/trigger — Manual micro-training tick

The "GPU hours reclaimed" metric is our favorite. It tracks total wall-clock seconds that were spent training instead of idling. After a month of operation, a typical personal deployment reclaims 200-400 GPU-hours that would have been wasted. That's 200 hours of continuous model improvement, for free, on hardware you already own.

Why Not Just Schedule Training?

You could run a cron job at 2 AM. Many people do. Here's why the idle-detection approach is better:

It trains more. Cron gives you one window. Idle detection gives you every gap between requests — lunch breaks, meetings, sleep, weekends. The total reclaimed compute is 5-10× what a nightly cron provides.
It never conflicts with inference. A cron job that starts at 2 AM will conflict with a 2:05 AM request. The idle trainer yields instantly. The user never notices.
It adapts to your schedule. If you work late on Tuesdays, Tuesday gets less training. If you take Friday off, Friday gets a full day. No configuration needed — it observes and reacts.
It's composable. The three-way rotation means specialist pretraining, agent specialization, and general adaptation all make progress. A cron job would need to pick one.
It has built-in safety. Daily budgets, rollback on regression, VRAM checks, pause/resume — these aren't bolted on. They're the core architecture. A cron script would need all of this built from scratch.

The Numbers

After 30 days of operation on a personal RTX 5090 deployment with moderate usage (50-200 requests/day):

Metric	Value
Total GPU hours reclaimed	347 hours
Total training batches	1,240
Total gradient steps	98,500
Adapter versions (general)	412
Adapter rollbacks	23 (1.9%)
Average loss improvement per week	-0.04
Agent specialization sessions	180
Specialist pretraining progress	78%
Night-mode batches	620 (50%)
Inference latency impact	0ms (measured)

The last line is the important one. Zero measurable impact on inference latency. Training only runs when the GPU is idle. When it's not idle, training isn't running. The two workloads are temporally disjoint by construction.

The Source

The complete implementation:

File	Lines	Purpose
`lib/training/idle_gpu_trainer.py`	~980	GPU state machine, idle detection, 3-way rotation
`lib/training/micro_trainer.py`	~1000	MicroTrainingDaemon: harvest → distill → train → hot-reload
`lib/training/agent_model_manager.py`	~750	Per-agent LoRA training + priority queue
`lib/training/experience_distiller.py`	~830	LLM-powered data quality amplification
`lib/training/specialist.py`	~350	From-scratch domain model (42M params)
`lib/core/SlotPool.py`	~650	AgentDispatcher + idle task scheduling
`services/training/AitherEvolutionCore.py`	~2400	Host service: 9 background loops + REST API

No external training infrastructure. No cloud GPUs. No scheduled downtime. Just a state machine that watches a number, and a trainer that learns from its own conversations.

The idle GPU trainer is part of AitherOS's self-improvement stack, alongside continuous micro-training, Dark Factory autonomous operations, and Neuron evolution. Together, they turn a single consumer GPU into a system that doesn't just serve — it learns, adapts, and improves itself every hour it has nothing better to do.

Enjoyed this post?

All posts Try AitherOS

Back to blog

architecturegputrainingdeep-diveinferenceloraself-improvementaitheros

The New Bitcoin Mining: How We Turned Idle GPU Cycles into Intelligence

April 13, 202622 min readAitherOS Engineering