Agentic Depth: The Metric That Actually Matters
There's a number nobody talks about when comparing agent frameworks: how many sequential, autonomous steps can the agent reliably complete before it loses coherence?
Not how many tokens it can generate. Not its MMLU score. Not how many tools it has access to. How many steps can it chain together — maintaining context, recovering from errors, managing state, and still delivering something correct at the end?
We call this agentic depth.
The Depth Problem
Every agent framework has a depth ceiling. Below it, the agent works beautifully. Above it, things get weird: context drift, tool misuse, recursive loops, hallucinated state, or just quietly producing wrong answers with confident formatting.
The ceiling isn't a hard wall. It's probabilistic. An agent that completes 3-step tasks at 98% reliability might complete 8-step tasks at 40% and 12-step tasks at 5%. The compound probability of each step succeeding multiplies down fast.
What determines depth? Five factors:
Context management. How does the agent handle information as it accumulates? Most frameworks dump everything into the context window and hope. By step 6, the important details from step 1 are competing with noise from steps 2-5.
Error recovery. What happens when step 3 fails? Does the agent retry the same thing? Does it try an alternative? Does it realize the failure invalidates assumptions from step 2? Most agents retry once and then either bail or hallucinate a success.
State tracking. Can the agent remember what it decided in step 2 when executing step 7? Not "is it in the context window" — can it actually find and use it? State management is where implicit context window + pray breaks down.
Tool orchestration. When an agent has 50 tools, does it pick the right one at step 8 after using 12 tools in previous steps? Or does it default to the most recently successful tool regardless of the current need?
Drift resistance. After 10 steps of autonomous execution, is the agent still working toward the original goal? Or has it subtly redefined the goal to match what it's been doing?
An agent that handles all five of these reliably at 12 steps is more valuable than one that occasionally survives 40 steps but fails unpredictably.
How Frameworks Handle Depth
OpenClaw
OpenClaw's skill file system is elegant: structured markdown files tell the agent exactly what to do for a given task. The agent reads the skill, follows the instructions, uses the right tools.
The depth ceiling hits around step 4-6 out of the box. Skill files are essentially static — they can't adapt to what happened in previous steps. If step 3 produces an unexpected result, the skill file for step 4 doesn't know about it. The agent has to improvise, and improvisation is where depth collapses.
OpenClaw's strength is that skill files are composable scaffolding. You can pre-plan the entire execution path. But this means the human (or another agent) has to do the planning — the depth is in the scaffolding, not the executor.
Hermes
Hermes uses self-evaluation loops: after each step, the agent assesses its own output and decides whether to continue, retry, or adjust. This is smart in theory — the agent is constantly course-correcting.
In practice, self-eval introduces drift. The agent's assessment of its own work is biased toward what it just produced. After 5-6 self-eval cycles, the agent has subtly redefined "good output" to mean "similar to what I've been generating." The self-eval stops being a correction mechanism and becomes a confirmation bias amplifier.
Hermes handles simple chains well. The compound drift toward the end of longer chains is what limits depth.
Vellum
Vellum takes a different approach: explicit checkpoints and workflow steps. Each step is defined, has clear inputs and outputs, and the workflow engine manages state between them.
This trades flexibility for reliability. A Vellum workflow with 15 steps will complete all 15 steps because the workflow engine — not the LLM — manages the execution. The LLM only needs to handle one step at a time.
The depth ceiling here isn't the framework — it's the individual step complexity. Each step gets full LLM attention without accumulated context noise. The limitation is that novel situations require workflow redesign, and the visibility you gain from checkpoints comes at the cost of dynamic adaptation.
What Depth Looks Like at Scale
AitherOS runs 43 specialized agent personas through a 12-stage context assembly pipeline, effort-aware model routing, and persistent memory. Here's what that means in practice:
Context isn't a window, it's a pipeline. Instead of stuffing everything into a context window, we assemble context for each step: relevant memories, active goals, tool results, and conversation history — ranked and filtered. Step 12 gets the context that's relevant to step 12, not a dump of everything from steps 1-11.
Effort routing selects the right model. Not every step needs the same model. Quick lookups use a 3B model. Standard reasoning uses an 8B orchestrator. Deep analysis escalates to a 14B reasoning model. The system matches compute to the actual step complexity, not the overall task complexity.
Agents specialize. Hydra reviews code. Athena audits security. Demiurge generates code. Apollo optimizes performance. Each agent has its own identity, prompt engineering, and tool preferences. Delegation to a specialist is a single API call that returns structured results.
Memory persists. Decisions from step 2 don't live in the context window — they live in a knowledge graph. Step 12 can query "what did we decide about the authentication approach?" and get a precise answer, not a 6000-token context dump.
Recovery is built in. When a step fails, the system has actual error recovery: retry with a different approach, escalate to a more capable model, or checkpoint and ask for human guidance. Not "retry the same thing 3 times."
The result: reliable completion of 15-25 step autonomous workflows, with graceful degradation and human-in-the-loop gates for high-risk decisions.
Let Your Agent Bootstrap Ours
Here's the practical move: your existing agent handles the shallow part (discovery, registration, installation), and Aitherium handles the deep work.
Even an agent that loses focus at step 4 can complete our bootstrap flow:
Cloud path (3 steps):
- Register —
POST /v1/billing/registerwith an email - Configure — set
base_urland API key in your client - Test — send a chat completion request
That's it. Your agent now has access to 43 specialized agents, persistent memory, multi-phase project orchestration, and deep reasoning — all through an OpenAI-compatible API.
Local path (6 steps):
- Install Ollama
- Pull a model (nemotron-orchestrator 4-bit runs on 8GB VRAM)
- Register with Aitherium (for agent identity + cloud tools)
- Install aither-adk (
pip install aither-adk) - Configure to point at local Ollama + cloud API key
- Test
Even at 4-bit quantization, nemotron-orchestrator through AitherOS's orchestration layer delivers real agentic depth. The model handles the reasoning; the orchestration layer handles context management, effort routing, memory, and recovery.
Full Docker path (4 steps):
- Clone the repo
- Choose a Docker profile
docker compose up -d- Open localhost:3000
The bootstrap guide is designed for agents, not humans. Every step has: an action, a command, an expected response, and a success criterion. An agent can parse it, execute it, verify it, and move on.
- Text version: /bootstrap.txt
- JSON version: /api/bootstrap
- Framework-specific templates: OpenClaw, Hermes, Vellum, MCP
Orchestration Discipline Is the Product
The more autonomy you add to an agent system, the more operational discipline matters. It's counterintuitive — shouldn't more autonomy mean less structure? — but the relationship is direct.
A single-turn chatbot needs prompt engineering. A 3-step agent needs tool definitions. A 12-step autonomous agent needs:
- Checkpoints — where the system pauses for validation
- Permissions — what each agent is allowed to do (capability tokens, not implicit trust)
- Trace visibility — what happened, why, and how long it took
- State management — not "hope the context window holds" but explicit state persistence
- Recovery logic — what to do when things fail, not if
- Effort budgeting — how much compute to spend on each step
- Goal tracking — is the agent still working on what you asked for?
AitherOS has all of these because it was built as an operating system for agents, not a demo bolted onto a chat API. The 43 agents, 12-stage context pipeline, effort routing, persistent knowledge graph, capability-based security — these aren't features. They're the operational discipline that makes depth possible.
The depth gap between frameworks isn't about model quality. GPT-4, Claude, Gemini — they all produce good single-turn output. The gap is in what sits between the model and the real world: the orchestration layer that manages context, routes effort, recovers from failure, and keeps the agent on track over 15 steps.
That's what Aitherium sells. Not a model. Not a prompt. The discipline that makes agentic depth reliable.