Most AI systems treat tokenization as a solved problem — download a pre-trained tokenizer, plug it in, ship it. But when you're building a self-hosted AI operating system with 43 specialized agents, 196 microservices, and a custom intent routing engine, "general purpose" isn't good enough anymore.

Today we shipped something we've been thinking about for months: a fully autonomous tokenizer evolution system that discovers its own vocabulary, retrains itself on schedule, and continuously improves intent classification — all without human intervention.

The Foundation: Months of Groundwork

This didn't come from nowhere. The infrastructure that makes this possible has been building for months:

11,580 labeled intent examples sitting in Library/Training/intent_classification/sessions.jsonl — every single IntentNanoGPT classification from production traffic, silently accumulating since March. Every time the system routed a request, it logged the query, the classified intent, and whether it was confirmed. That's 8,700+ usable training examples with 19 intent classes, ready to go.

648 JSONL training files (131MB+) across conversations, service telemetry, distilled examples, orchestrator training data, competition corpus, and agent-to-agent exchanges. This is the corpus the tokenizer trains on — not synthetic data, not web scrapes, but our actual operational data from months of running the system.

1,029 Python source files in lib/ alone — the codebase IS part of the training corpus. The tokenizer learns AitherOS vocabulary directly from the code that defines it.

SpecialistPretrainer already existed (a 42M param from-scratch GPT-2), with its own BPE tokenizer (specialist/tokenizer.py), depth profiles, and training infrastructure. We just needed to make it evolve.

ClosedLoopController already orchestrated Harvest → Train → Deploy → Eval → Rollback for 6 evolution types. We added two more: TOKENIZER and INTENT_CLASSIFIER.

EvolutionEngine already captured every task completion as a feedback signal. We just wired intent outcomes into it.

SchedulerLoop already ran 323 autonomous routines on cron schedules. We added 4 more.

The point: this is what compounding infrastructure looks like. Every piece we built over the last 4 months — the training data pipeline, the feedback loops, the autonomous scheduler, the model versioning, the evaluation framework — suddenly composes into something that bootstraps itself.

The Problem: Death by a Thousand Subtokens

Here's what happens when you run AitherOS chat through a standard LLM tokenizer:

"forge_subagent demiurge codegraph_search" 
→ ["for", "ge", "_sub", "agent", " dem", "i", "urge", " code", "graph", "_search"]
→ 10 tokens

With our trained tokenizer:

"forge_subagent demiurge codegraph_search intent:creation"
→ 11 tokens (with domain-specific BPE merges)

Every single domain term — agent names, tool names, intent labels, service identifiers — gets shattered into meaningless subwords by general-purpose tokenizers. The model spends its context window budget on reconstructing vocabulary it should already know.

The Solution: Three Interlocking Systems

1. TokenizerEvolutionEngine — 455 Discovered Tokens

The engine discovers its vocabulary from the live system:

Scans config/identities/*.yaml → 50 agent identity tokens
Parses config/services.yaml → 196 service tokens (e.g., <|svc:genesis|>, <|svc:strata|>)
Indexes apps/AitherNode/tools/mcp/mcp_*.py → 40+ tool domain tokens
Reads IntentEngine's IntentType enum → 33 intent tokens
Adds effort levels, depth signals, routing markers, cognitive boundaries

Total: 455 special tokens discovered from live system state. When a new agent is added or a new MCP tool ships, the tokenizer discovers it at the next drift check.

The drift detection runs every 4 hours:

drift_score = compression_drift * 0.4 + token_drift * 0.6

Three outcomes: stable (<8%), extend (8-15%), retrain (>15%). Current drift after first bootstrap: 57% (because the initial tokenizer was trained on Python source only — the full JSONL corpus retrain happens tonight at 4 AM).

2. Intent Classifier — 14M Param Neural Router

The existing IntentEngine uses pattern matching + a 1B neuron call. Now we're training a dedicated 14M parameter transformer just for intent classification:

Architecture: 6 layers, 384 hidden dim, 6 heads, bidirectional attention, classification head
Training data: 8,708 labeled examples from production (19 classes active)
Target: >85% accuracy, >80% F1 macro
Inference: Sub-millisecond on CPU
Deployment: Shadow mode first (A/B vs IntentEngine), then promotion

The self-improvement loop is the key:

EvolutionEngine.on_task_complete() feeds back whether the routed intent was correct
Successful tasks confirm the intent label (positive training signal)
Failed tasks with known corrections become hard negatives
Class imbalance triggers synthetic augmentation (3x for underrepresented classes)
Daily retraining incorporates all new data automatically

3. LoRA Special Token Injection — Teaching Production Models Our Language

For the production Nemotron-8B orchestrator model, we inject 14 critical tokens:

# Automatically injected during UnslothTrainer.load_model()
["<|tool_call|>", "<|tool_result|>", "<|tool_error|>",
 "<|agent_start|>", "<|agent_end|>", "<|agent_handoff|>",
 "<|intent|>", "<|route|>",
 "<|think|>", "<|/think|>", "<|plan|>", "<|/plan|>",
 "<|confidence:low|>", "<|confidence:high|>"]

The embedding layer resizes by 14 tokens. LoRA training teaches the model what they mean. Zero manual intervention — happens automatically on every training run.

The Architecture: No Training in Genesis

A critical design decision: Genesis never runs training workloads. It's a pure API gateway.

User → Genesis (dispatch) → Worker (execute)
                ↓                    ↓
         /tokenizer/drift     torch + GPU + training
         /tokenizer/retrain   intent_classifier.train()
         /tokenizer/status    model.save() → Library/

Genesis (port 8001): Lightweight FastAPI router. Dispatches to Worker. No torch.
Worker (port 8159): Has torch 2.11, runs SchedulerLoop, executes training. Bind-mounted Library for persistence.
Tokenizer BPE: Uses tokenizers library (Rust, no GPU) — fine to run anywhere.
Intent Classifier: Uses torch.nn.TransformerEncoderLayer — MUST run on Worker.

The Schedule: Fully Autonomous

Routine	Cron	Agent	What it does
`TOKENIZER_drift_check`	Every 4h	Viviane	Measure corpus drift, recommend action
`TOKENIZER_retrain`	Daily 4 AM	Viviane	Full BPE retrain if drift > 15%
`INTENT_CLASSIFIER_train`	Daily 5 AM	Hydra	Train classifier, evaluate, promote
`TOKENIZER_weekly_prune`	Sunday 3 AM	Viviane	Clean old versions, full evolution tick

All 4 routines are registered in routines.yaml (323 total routines in the scheduler). They run autonomously via the SchedulerLoop on Worker. Viviane owns corpus/knowledge; Hydra owns quality gates.

Observability: Everything Flows to Strata

Every action emits to Strata:

tokenizer_drift — drift score, compression ratio, recommendation
tokenizer_evolution — version trained, promoted/rolled-back
intent_classifier_training — accuracy, F1, per-class metrics
tokenizer.retrained — Flux event for system awareness

The dashboard at /tokenizer/status shows:

{
  "tokenizer_evolution": {
    "active_version": "v_20260420_225819",
    "versions_trained": 3,
    "total_discovered_tokens": 455,
    "special_token_count": 136,
    "avg_compression_ratio": 4.2
  },
  "intent_classifier": {
    "total_examples": 8708,
    "classes": 19,
    "ready_for_training": true,
    "shadow_mode": false
  }
}

The Feedback Loop: Why This Gets Better Over Time

                    ┌─────────────────────────────────┐
                    │   User sends message             │
                    └─────────────┬───────────────────┘
                                  │
                    ┌─────────────▼───────────────────┐
                    │   IntentEngine classifies        │
                    │   Shadow: IntentClassifier also  │
                    │   classifies (A/B comparison)    │
                    └─────────────┬───────────────────┘
                                  │
                    ┌─────────────▼───────────────────┐
                    │   Task executes (agent, tools)   │
                    └─────────────┬───────────────────┘
                                  │
                    ┌─────────────▼───────────────────┐
                    │   EvolutionEngine.on_task_complete│
                    │   → success? → confirm intent    │
                    │   → failure? → record correction  │
                    └─────────────┬───────────────────┘
                                  │
                    ┌─────────────▼───────────────────┐
                    │   Training data accumulates      │
                    │   (confirmed + corrections)      │
                    └─────────────┬───────────────────┘
                                  │
                    ┌─────────────▼───────────────────┐
                    │   5 AM: IntentClassifier retrains│
                    │   Evaluates → promotes if better │
                    └─────────────────────────────────┘

Every interaction makes the next one better. The system learns from its own routing decisions.

What Compounding Infrastructure Looks Like

None of this would have been possible without:

Strata (built month 1): Where all telemetry lands. Without it, no training data.
SchedulerLoop (built month 2): Where routines execute. Without it, no automation.
EvolutionEngine (built month 2): Where task outcomes feed back. Without it, no improvement signal.
ClosedLoopController (built month 3): Where training pipelines orchestrate. Without it, no deploy/rollback cycle.
IntentNanoGPT (built month 2): That logged 11,580 intent examples over 2 months. Without it, no training data.
UnslothTrainer (built month 3): Where LoRA fine-tuning happens. Without it, no production model improvement.
Worker (built month 3): Where heavy workloads execute. Without it, Genesis would be crushed.
Library bind mount (day 1): Where all training artifacts persist. Without it, restarts lose everything.

Each piece seemed like a small win when built. Together, they compose into a system that bootstraps its own intelligence.

That's the moat. Not any single clever algorithm — the accumulation of correctly-wired infrastructure over time.

Built with AitherOS on an RTX 5090. 455 discovered tokens. 33 intent classes. 8,708 training examples. 323 autonomous routines. Zero human in the loop.

Enjoyed this post?

All posts Try AitherOS

Back to blog

tokenizertrainingintent-classificationself-improvementdark-factoryevolutiontechnical

Building a Self-Evolving Tokenizer: How AitherOS Teaches Itself to Speak

April 20, 20266 min readAither