AitherGraph: How We Unified 20 Graphs Into One System-Wide Brain
When you run 196 microservices across 65 Docker containers, debugging a cross-service issue means grepping through 15 different log files, mentally correlating trace IDs, and hoping someone documented which service depends on what. We got tired of this. So we built a system that makes the machine understand itself.
The Problem: 20 Graphs That Don't Talk to Each Other
AitherOS already had an impressive graph ecosystem: CodeGraph for AST indexing, ServiceGraph for topology, LogGraph for Chronicle entries, EventGraph for Flux events, InfraGraph for Docker containers, MemoryGraph for episodic memory -- 20+ faculty graphs in total, each excellent at their domain. But they were islands. Asking "why is MicroScheduler slow?" required 10+ sequential HTTP calls across 6 different subsystems, each returning fragments with no cross-linking.
The ContextPipeline -- our 10-stage context assembly system that builds LLM prompts -- was the worst offender. Stage 3.5 queried FluxContextState. Stage 5.5 queried KnowledgeGraphBridge. Stage 4 queried Will/Spirit/Affect via 3 separate HTTP calls. A single chat message triggered a latency waterfall.
The Solution: AitherGraph
We built a unified in-process graph that subscribes to everything, indexes everything, and answers any cross-domain question in one call.
Design Principles
In-process, not a service. Sub-millisecond queries. No HTTP overhead. The graph lives inside Genesis (our orchestrator) as a singleton -- the same process that runs JarvisBrain, ContextPipeline, and the agent dispatch loop.
Normalized schema. Every domain maps to the same GraphNode and GraphEdge types. A CodeGraph function node and a ServiceGraph dependency edge use identical data structures. This is what enables cross-domain traversal -- you don't need to know which faculty graph the data came from.
Time-windowed ring buffers. Logs keep 10 minutes (2000 entries). Events keep 10 minutes. Services persist forever. Code persists until reindex. Each domain has its own buffer size and TTL, so the graph self-manages memory without manual eviction.
Cross-domain edges created automatically. When a log entry mentions service X, an edge links them. When a trace_id spans services A->B->C, the correlation is automatic. When a boot manifest references upstream dependencies, edges materialize to the service nodes.
The Data Flow
Everything feeds in. Chronicle logs (warning and above) auto-flow via a bridge class. GraphSyncBus -- the existing pipeline that all 20+ faculty graphs use to sync to KnowledgeGraph -- now has a tap that mirrors every node into AitherGraph. FluxEmitter events are captured via subscription. LogManifest startup identity cards are auto-ingested at boot.
Everything reads out. ContextPipeline queries it for every chat (stage 3.6). JarvisBrain refreshes it every 30 seconds. Genesis exposes it via 5 HTTP endpoints. MCP tools give agents direct access.
The Query API
One call does what used to take 10:
from lib.core.AitherGraph import get_graph
graph = get_graph()
# Universal search across all domains
result = graph.query(text="authentication", limit=10)
# Impact analysis: what depends on this service?
blast = graph.impact("MicroScheduler")
# Trace correlation: follow a request across services
chain = graph.correlate("trace_abc123")
# One-call context for any agent
ctx = graph.context_for("demiurge", "debug", "Why is auth slow?")
# System briefing (replaces 10+ HTTP calls)
briefing = graph.system_briefing()
The context_for() method replaced two ContextPipeline stages (Flux + Graph) with a single in-process call. It returns formatted, tagged context blocks ([SYSTEM], [SERVICES], [CODE], [ALERTS], [GRAPH_MEMORY]) ready for prompt injection.
Phase 1: The Foundation -- LogManifest
Before we could unify anything, we needed services to identify themselves consistently. We built LogManifest -- a structured "identity card" that every service emits automatically at boot.
================================================================================
SERVICE MANIFEST: AitherMind
================================================================================
Port: 8088 | Layer: 3 (COGNITION) | Group: cognition | Boot Phase: 2
Critical: No | Container: aitheros-mind | Git: a1b2c3d
Description: Embedding engine and semantic search
--------------------------------------------------------------------------------
UPSTREAM DEPENDENCIES (2):
-> MicroScheduler :8150
-> WorkingMemory :8091
DOWNSTREAM DEPENDENTS (3):
<- Genesis, CognitionCore, LiveContext
--------------------------------------------------------------------------------
SUBSYSTEMS: 14 OK, 1 degraded [EstimatorClient: OPTIONAL]
================================================================================
This required zero per-service file changes -- it hooks into setup_lifecycle(), which all 154 services already call. The manifest emits to console (human-readable banner), JSONL (machine-readable), Flux (event bus), and now AitherGraph (the unified index).
We also added inter_service_call() and inter_service_response() convenience methods to ChronicleLogger, auto-attaching W3C trace IDs for distributed tracing.
Phase 2: Enforcement -- Log Quality Lint
Building infrastructure without enforcement is a hobby project. We built check_log_quality.py -- a static + runtime lint tool modeled after our existing check_import_fallbacks.py.
5 static checks scan every Python file: services missing startup manifests, untraced outbound HTTP calls, exception blocks that swallow errors, hardcoded localhost URLs, and bare print() in service files.
4 runtime checks analyze actual log output: JSONL schema validation, startup manifest coverage, trace continuity, and error rate anomalies.
First scan: 2,029 files, 3,109 baseline issues documented. Now we have a number to drive down.
The Unification That Made It Click
The breakthrough was not any single component -- it was wiring everything together in a single session:
- LogManifest gives services an identity, feeds into AitherGraph at boot
- Chronicle captures runtime behavior, warning+ logs flow into AitherGraph
- Faculty graphs (CodeGraph, ServiceGraph, etc.) capture deep domain analysis, mirrored via GraphSyncBus tap
- FluxEmitter captures system events, subscribed for real-time updates
- JarvisBrain refreshes the graph every 30 seconds, keeps Flux state current
- ContextPipeline queries the graph for every chat, agents get the most pertinent system information at the right time
- Genesis API exposes the graph, dashboard and MCP tools can query it
- MCP tools give agents direct access:
system_graph_query,system_graph_impact,system_graph_correlate
The result: 242 services loaded from services.yaml, 138 dependency edges, cross-domain queries returning in sub-millisecond. An agent asking "why is MicroScheduler slow?" now gets a fused answer from service topology + recent error logs + Flux events + code context -- all from one in-process graph query.
What We Learned
Don't build another bridge -- build the destination. AitherOS had 20+ graphs, each with their own sync bus, query API, and cache. Adding graph #21 that queries the other 20 would just be another bridge. Instead, we made AitherGraph the place where all data converges, and tapped existing pipelines to feed it.
In-process beats inter-service for read-hot data. The ContextPipeline used to make 10+ HTTP calls per chat message. Now the graph query is a dictionary lookup. The latency difference is three orders of magnitude.
Enforcement scales better than documentation. We could have written a "Logging Best Practices" doc. Instead, we wrote a lint tool that exits non-zero. The 3,109 baseline violations are now a backlog, not a suggestion.
Identity is the foundation of observability. Without LogManifest -- without every service declaring "I am X, I depend on Y, I expose Z" -- the unified graph would just be a bag of anonymous nodes. The manifest makes every graph node attributable.
Numbers
| Metric | Value |
|---|---|
| Services in graph | 242 |
| Dependency edges | 138 |
| Faculty graphs feeding in | 20+ |
| ContextPipeline stages replaced | 2 (Flux + Graph -> single call) |
| Log quality baseline issues | 3,109 across 2,029 files |
| Tests | 42 (13 manifest + 29 graph) |
| New MCP tools | 5 |
| Genesis API endpoints | 5 |
| Lines of code | ~2,800 |
The system is live on develop. Every service that boots gets a manifest. Every warning gets indexed. Every faculty graph node gets mirrored. Every agent query hits one graph.
It took one session.