Early Access Preview—AitherOS is in active development. Features may change, break, or disappear.

LLM

0/24

GPU0/0GB

IDLEFREE

Monitoring services…

•Connecting to services…

Live Demo

Invite Only

Theme

GitHub

Live Demo

Invite Only

Theme

GitHub

Back to blog

engineeringcognitionarchitectureretrieval

What a 2012 Sculpture Recognition Paper Taught Us About Context Assembly and Tree Search

Name: AitherOS
Author: Aitherium

March 9, 202618 min readAitherium

Sometimes the fix you need is hiding in a paper about something you'd never build.

Relja Arandjelović and Andrew Zisserman — the Oxford VGG group — published a paper in 2012 called "Name That Sculpture." The problem: given a photo of a sculpture, identify it. The domain is oddly specific. We will never build a sculpture recognition system. But the paper contains a general pattern that turned out to solve two problems we'd been fighting for months.

Here's what they proved: if you have two retrieval methods that are good at different things, fusing their results always beats picking the better one. Even when one method looks terrible in isolation.

Their two methods:

SIFT — texture-based features. Great for carved inscriptions, ornate detail, rough stone. Terrible for smooth marble.
Bag-of-Boundaries — contour features. Great for smooth curved surfaces, silhouettes. Terrible for textured surfaces.

Neither method worked alone. SIFT failed on every smooth statue. Bag-of-Boundaries failed on every textured relief. But when they fused the results — simple max-score: for each candidate, take whichever method gave the higher score — accuracy jumped across all sculpture types.

The key word is complementary. Not "two methods that agree." Two methods that cover each other's blind spots.

We have exactly this problem. Three times.

Problem 1: The ContextPipeline Ranking Problem

AitherOS assembles context for every LLM call through a 12-stage pipeline. By the time the system reaches Stage 7 — re-scoring — it has chunks from five or six different sources running in parallel:

Source	What it retrieves	Score range
CodeGraph	Matching functions, classes, call chains	0.50 – 0.85
Spirit	Long-term user memory, past conversations	0.30 – 0.70
KnowledgeGraph	Entity relationships, structural data	0.20 – 0.50
Neurons	Web search, live data, system telemetry	0.10 – 0.90
MemoryBus	Working memory, recent context	0.60 – 0.90

The old pipeline did the obvious thing: sort all chunks by their raw relevance score, highest first.

This is wrong. A CodeGraph chunk scoring 0.70 and a KnowledgeGraph chunk scoring 0.70 are not equally relevant. CodeGraph's scores cluster between 0.50 and 0.85 — 0.70 means "slightly above average." KnowledgeGraph's scores cluster between 0.20 and 0.50 — if something scored 0.70, it's a perfect match. But naive sort treats them identically.

Worse: MemoryBus scores are inherently high because working memory is recent and always somewhat relevant. Every query shoves the last five minutes of conversation into the top of the context, whether or not it's actually useful for this particular question.

The result: the system consistently over-weighted working memory and under-weighted graph data. If you asked "how does agent routing work?" the top five chunks were always your recent Docker debugging session, not the architectural documentation that actually answered the question.

The Sculpture Fix

The paper's pattern, generalised:

Normalise per source. Don't compare raw scores across sources. Normalise each source's scores to [0, 1] independently. Now a "best match from KnowledgeGraph" and a "best match from CodeGraph" are both 1.0, regardless of their original scale.
Fuse, don't pick. Don't choose the best source per query. Fuse all sources and let the best result from the best source for this specific question surface naturally.
Adapt to the query type. The paper's core insight: different inputs need different methods. "Textured" inputs (rich, semantic queries) carry strong signal for embedding-based retrieval. "Smooth" inputs (sparse, structural queries) carry better signal for graph/keyword retrieval.

We built this as FusedRetrieval — a general engine with four strategies:

query → estimate_query_richness() → [0.0 sparse ←→ 1.0 rich]

  richness < 0.3  →  max-score fusion  (let the single best source win)
  richness > 0.6  →  weighted-sum      (reward multi-source consensus)
  0.3 – 0.6       →  interpolate       (blend both strategies)

The estimate_query_richness() function is the generalised "textured vs smooth" discriminator. It looks at query length, vocabulary diversity, and the ratio of semantic signal words ("explain", "architecture", "design") to structural signal words ("list", "find", "which", "depends on"). Short, structural queries get low richness. Long, conceptual queries get high richness.

Here's what this looks like on real queries:

Query	Richness	Strategy
`Docker`	0.42	max-score (let the one best source dominate)
`list all agents`	0.26	max-score
`what ports does Genesis use`	0.50	interpolated blend
`how does the agent routing architecture work`	0.74	weighted-sum (reward cross-source agreement)
`explain the design pattern behind effort-based model selection`	0.98	weighted-sum

This matters because different sources excel at different query types:

"list all agents" → CodeGraph and ServiceGraph have this. Spirit doesn't. Max-score lets the graph result dominate.
"explain effort-based scaling" → Spirit remembers a past conversation where this was discussed. KnowledgeGraph has the structural relationship. CodeGraph has the EffortScaler source. All three contribute partial answers. Weighted-sum promotes results that appear across multiple sources.

This is the exact same dynamic as SIFT vs Bag-of-Boundaries. CodeGraph is like SIFT — great for precise, textured queries. Spirit is like Bag-of-Boundaries — great for smooth, conceptual queries. Fusing them beats either alone.

Before and After

Query: "how does agent routing work in the A2A architecture"

Before (naive sort by raw score):

#1  0.720  codegraph    Agent routing: AgentForge dispatches via capability match
#2  0.700  graph        KnowledgeGraph: AgentForge→EffortScaler→UnifiedChatBackend
#3  0.680  codegraph    Service mesh: Genesis→Council→Agent via A2A protocol
#4  0.650  spirit       User previously asked about agent communication patterns
#5  0.600  spirit       Session context: working on A2A gateway integration

After (fused source-aware re-scoring):

#1  0.397  spirit       User previously asked about agent communication patterns
#2  0.382  codegraph    Agent routing: AgentForge dispatches via capability match
#3  0.374  graph        KnowledgeGraph: AgentForge→EffortScaler→UnifiedChatBackend
#4  0.344  codegraph    Service mesh: Genesis→Council→Agent via A2A protocol
#5  0.337  neurons      Web search: A2A protocol specification by Google

The Spirit chunk rose from #4 to #1. It should be #1 — when the user has asked about something before, that context is extremely high-value for a semantic question. But its raw score was 0.650 while CodeGraph's was 0.720, so naive sort buried it.

After normalisation within each source, Spirit's 0.650 becomes a high score for Spirit, and the adaptive weighting for this rich query (richness 0.78) upweights semantic sources like Spirit.

The Neurons result (web search for the A2A specification) also climbed from #8 to #5. Its raw score was 0.500, but after per-source normalisation it was the top Neurons result, and multi-source consensus promoted it.

Integration

The fusion engine plugs into two places:

ContextPipeline Stage 7.5 — after all sources have injected their chunks but before Stage 8 surgical eviction. The existing Jaccard re-scoring runs first, then rescore_chunks_fused() applies per-source normalisation and adaptive fusion. The blending formula is 60% fused score + 40% original — enough to shift rankings meaningfully without erasing ingestion-time signals entirely.
AitherMemoryBus.query() — the cross-tier memory query. Four tiers (workingmemory, spirit, mind, graph) run in parallel; their results are now fused instead of concatenated. WorkingMemory no longer automatically dominates by virtue of having inherently higher scores.

Problem 2: The MCTS Evaluation Problem

AitherMCTS is our Monte Carlo Tree Search engine for cognitive planning. When the system needs to reason about a complex question, MCTS explores a tree of retrieval actions:

Root: "how does agent routing work?"
├── entity_lookup:AgentForge     (KnowledgeGraph)
│   ├── neighbor_expand:AgentForge  (graph traversal)
│   └── entity_lookup:EffortScaler
├── faculty_graph_lookup:codegraph  (CodeGraph)
│   └── entity_lookup:ChatEngine
├── vector_search:"agent routing"   (Nexus documents)
├── community_zoom:agents           (WikiGraph community)
└── web_search:"A2A protocol"       (AitherSearch)

Each branch is a different retrieval action across a different backend. The tree expands by querying KnowledgeGraph, the seven faculty graphs (code, docs, services, config, API, infra, scripts), Nexus document search, WikiGraph community summaries, and live web search through AitherSearch.

MCTS needs two things to work:

Priors — before visiting a branch, how promising does it look? This determines exploration order.
Evaluation — after visiting a branch, what did we learn? This determines backpropagation and future visits.

The current implementation computes both using FusedRetrieval — the same engine that powers the ContextPipeline:

For child priors — _fuse_expansion_priors() groups all candidate branches by action type (source) and feeds them through rank-normalised adaptive fusion. Children from sources that are complementary to the query type get boosted; children from weak sources get suppressed. The formula blends 70% fused score with 30% original prior to prevent complete overwrite of retrieval-time signals.

For leaf evaluation — when a node has evidence from two or more source types, _fused_evaluate() normalises evidence scores per source and fuses them adaptively. Rich queries boost the value of branches that found documents and community context. Sparse queries boost the value of branches that found precise graph matches. When only a single source is present, the legacy evaluator fires as fallback.

The fusion_strategy parameter on SearchRequest controls which strategy is used (adaptive, max_score, rrf, weighted_sum), and the chosen strategy is returned in the response for observability.

This is the same pattern: graph traversal is like SIFT (precise, structural), document/web search is like Bag-of-Boundaries (broad, semantic). Fusing them adaptively beats either alone — and beats fixed caps.

User Query
    │
    ├──→ ContextPipeline (Stages 1-12)
    │        └── Stage 7.5: FusedRetrieval re-scoring ✓
    │
    ├──→ AitherMemoryBus.query()
    │        └── Tier fusion via FusedRetrieval ✓
    │
    └──→ AitherMCTS /mcts/search
             ├── _expand_node()  →  queries 5 backends in parallel
             ├── _fuse_expansion_priors() → rank-normalised fusion ✓
             └── _fused_evaluate()  →  adaptive multi-source evaluation ✓

Problem 3: The Simulation Event Problem

AitherPrometheus is our tick-based simulation engine. It runs civilizations — NPCs, weather, combat, economy, quests — through a 39-manager tick loop. Every tick, Prometheus picks procedural generators to produce events, retrieves NPC memories for dialogue context, and evaluates sentiment for character interactions.

All three of these had the same structural flaw: using a single source where multiple complementary sources existed.

NPC Memory Recall

When an NPC needs to remember something (for dialogue, for decision-making), Prometheus queries two sources:

Nexus semantic search — finds memories by meaning. "What does Aldric remember about the king?" retrieves semantically relevant memories regardless of when they happened.
In-memory episodic stream — the NPC's recent event log. Always fresh, always available, but unsorted by relevance.

The old code used Nexus-or-fallback: try semantic search, and if it fails or returns nothing, fall back to the raw stream. This is the binary choice the sculpture paper warns against.

The problem: Nexus results are relevance-ranked but miss recent events (indexing lag). Stream memories are chronologically ordered but miss semantic matches. A question like "how does Aldric feel about the treaty?" needs both — the relevant old conversation about the treaty and the recent event where the treaty was mentioned.

fuse_npc_memories() applies the same pattern: Nexus results are scored by semantic relevance; stream memories are scored by recency (exponential decay from current day). FusedRetrieval normalises both and fuses adaptively.

Generator Selection

Every simulation tick picks 3–5 procedural generators from a registry of 120+ types (weather, combat, dialogue, loot, quests, music, etc.). The old code used random.sample() — pure uniform random.

This produces incoherent worlds. A thunderstorm tick followed by a bard music tick followed by a loot drop tick has no narrative throughline. The generators have categories and labels that carry context — but random selection ignores them entirely.

Two complementary signals exist:

Keyword relevance — does the generator's category/label match the current world context? (Structural signal, like SIFT.)
Category diversity — has this category been selected recently? Categories with fewer recent picks get a variety boost. (Coverage signal, like Bag-of-Boundaries.)

fuse_generator_selection() feeds both into FusedRetrieval. The query is the current world context (weather state, time of day, active events). Adaptive weighting means: when the world context is rich and specific ("a dark stormy night with wolves howling"), keyword relevance dominates and weather/combat generators surface. When the context is sparse ("day 47"), variety dominates and the system explores diverse generator types.

Vibe Check Sentiment

The vibe check endpoint trains two competing NanoGPT micro-models — one on positive text, one on negative — and compares their anomaly scores. Lower anomaly = the model is less "surprised" by the input.

The old code did sentiment = "positive" if pos_score < neg_score else "negative". Raw comparison, no normalisation, no adaptation.

But the two models have complementary strengths that map directly to the sculpture paper's insight:

Positive model — trained on expressive, enthusiastic text. Less surprised by rich, emotional language. (Textured = SIFT.)
Negative model — trained on blunt, critical text. Less surprised by sparse, direct language. (Smooth = Bag-of-Boundaries.)

fuse_vibe_scores() uses estimate_query_richness() from FusedRetrieval to weight between them. Rich, emotional text ("I absolutely love how this turned out, it's magnificent!") upweights the positive model's signal. Sparse, factual text ("bad") upweights the negative model's signal. The confidence delta reflects this weighting.

The General Principle

The sculpture paper is really about a meta-pattern:

If your system retrieves from multiple sources and those sources have complementary strengths, fusing their results will beat any fixed selection or weighting.

This applies anywhere you have heterogeneous retrieval:

Context assembly (ContextPipeline) ✓
Memory recall (MemoryBus) ✓
Planning (MCTS priors + leaf evaluation) ✓
Simulation event selection (Prometheus generator picking) ✓
NPC memory recall (Prometheus Nexus + episodic stream) ✓
Sentiment analysis (Prometheus dual-model vibe check) ✓
Agent selection (AgentForge uses keyword scoring today — could use fusion)
Tool selection (neurons use regex + micro-transformer — consumption feedback could create a third signal to fuse)

The paper's other insight — that the optimal fusion depends on the input type — is even more general. We turned "textured vs smooth" into "rich vs sparse" query classification. But the same logic applies to any pair of complementary dimensions. Fast vs slow. Precise vs exploratory. Recent vs historical. Expressive vs blunt.

Every time you catch yourself writing if query_looks_structural: use_graph() else: use_vector(), you're making a hard binary choice where fusion would work better. Don't pick. Fuse. Let the scores talk.

Numbers

Component	Metric
`FusedRetrieval`	821 lines, 4 strategies, 3 normalisation methods
`AitherMCTS`	840+ lines, 6 retrieval backends, 7 faculty graphs, fused priors + evaluation
`AitherPrometheus`	2,680+ lines, 3 fusion points (memories, generators, vibe check)
New test coverage	53 FusedRetrieval + 8 MCTS fusion + 10 Prometheus fusion (all passing)
Existing tests	99 ContextPipeline + 69 MemoryBus (all passing)
Query richness estimation	40% length, 20% vocabulary diversity, 40% signal word ratio
Score blending	60% fused + 40% original relevance (ContextPipeline); 70% fused + 30% original (MCTS priors)
MCTS backends	KnowledgeGraph, 7 faculty graphs, Nexus docs, Nexus communities, AitherSearch
Prometheus fusion points	NPC memory recall, generator selection, vibe check sentiment

What's Next

The immediate wiring is done. FusedRetrieval now powers five systems: ContextPipeline, MemoryBus, MCTS, and three subsystems inside Prometheus (NPC memory recall, generator selection, vibe check). The pattern keeps applying.

Next frontiers:

Backpropagation of consumption data. Just like the neuron training system learns which neurons produce consumed context, MCTS can learn which branches produce evidence that actually appears in the final response. That consumption signal becomes a third fusion source alongside score and rank. Prometheus can do the same — generators whose output gets referenced in dialogue or player actions should be weighted higher.

Agent selection fusion. AgentForge currently uses keyword-based capability matching to dispatch tasks to agents. The agent with the best keyword overlap wins. But we have multiple signals: keyword match, historical success rate per agent, effort-level suitability, and live load. Fusing these would let the dispatcher adapt — a heavily loaded agent with perfect keyword match should sometimes lose to a lightly loaded agent with good-enough match.

Cross-tick narrative coherence. Prometheus's generator fusion currently operates per-tick. But narrative quality depends on sequences of events, not individual events. A fusion layer that scores generator candidates against the last N events (semantic similarity to maintain theme + diversity to prevent repetition) would produce more coherent stories.

The sculpture paper has one more lesson we haven't used yet: their Flickr-based voting system, where photos with matching geolocation tags provided a third, independent signal. We have the equivalent — user reactions, follow-up questions, conversation flow. That's a fusion source waiting to be wired in.

FusedRetrieval is available in lib/core/FusedRetrieval.py. The ContextPipeline, MemoryBus, MCTS, and Prometheus integrations are all live. 18 new fusion-specific tests cover the MCTS and Prometheus wiring.

Enjoyed this post?

All posts Try AitherOS