Early Access Preview
Back to blog
engineeringknowledge-graphgraphragarchitecturedeep-dive

Wikipedia Knowledge Graph: How We Gave Our Agents Real-World Awareness

March 11, 202610 min readAitherium
Share

Our agents are good at code. They understand service topology, memory graphs, API relationships, and log traces. But ask one about the difference between a transformer architecture and a recurrent neural network, or whether retrieval-augmented generation is a solved problem, and you get silence. They were blind to the real world.

That changed today.

The Problem: Agents Without World Knowledge

AitherOS runs 15 faculty graphs -- specialized in-process caches that index code, memory, services, configs, documentation, media files, and more. These graphs sync to a unified KnowledgeGraph service via GraphSyncBus, and a CrossDomainLinker discovers relationships between them. An agent reasoning about a function can see its tests, its config dependencies, the services that call it, and the memory entries that reference it.

But all of that is internal knowledge. When a developer writes a comment mentioning "attention mechanism" or "RLHF", the system has no idea what those terms mean. When a code review references "the CAP theorem", there's nothing in the graph to anchor that reference. The agents are experts in our codebase but ignorant of the concepts our codebase builds on.

We already had half the solution. NewsWire pulls from five news APIs every 30 minutes. NewsGraph ingests Bluesky and Moltbook feeds every 5-10 minutes. CompetitiveIntel synthesizes market intelligence. Our agents know what's happening right now. But they didn't know the foundational knowledge that gives current events meaning.

The Solution: WikipediaGraph

WikipediaGraph is the 15th faculty graph in AitherOS. It auto-ingests Wikipedia articles via the REST API on a 6-hour schedule, extracts entities and categories, builds a local knowledge graph, and syncs everything to our unified KnowledgeGraph through the same GraphSyncBus that every other faculty graph uses.

Topic-Driven Ingestion

Rather than trying to ingest all of Wikipedia (that's what our bulk pipeline is for), WikipediaGraph watches a configurable list of topics defined in config/wikipedia.yaml:

watched_topics:
  - artificial intelligence
  - machine learning
  - autonomous agents
  - neural network
  - large language model
  - reinforcement learning
  - computer vision
  - natural language processing
  - robotics
  - knowledge graph
  - transformer architecture
  - deep learning
  - generative AI
  - multi-agent system
  - retrieval augmented generation

domain_topics:
  - cybersecurity
  - cloud computing
  - edge computing
  - distributed systems
  - microservices architecture

For each topic, WikipediaGraph searches Wikipedia, fetches the top articles, and processes them. The topic list is user-configurable -- add your domain's terms and the system starts building knowledge around them.

Three-Layer Extraction

Every article gets processed into three node types:

Article nodes carry the summary (up to 500 characters), page ID, content hash for change detection, and metadata like fetch time and Wikipedia's last-modified timestamp.

Entity nodes are extracted via regex-based NER that identifies capitalized multi-word phrases. "Machine Learning", "Recurrent Neural Network", "Yann LeCun" -- these become first-class nodes in the graph, cross-referenced across every article that mentions them. An entity that appears in 15 different articles is clearly important; one that appears in one is contextual.

Category nodes come from Wikipedia's own category system. An article about "Transformer (deep learning architecture)" belongs to categories like "Neural network architectures", "Natural language processing", and "Deep learning". These hierarchical categories give the graph taxonomic structure for free.

How the Cache Works

We maintain an OrderedDict-based LRU cache of up to 1000 articles. The ordered dict tracks access patterns -- every get_article() call moves the article to the end. When the cache fills up, the least-recently-accessed articles are evicted, and their entity and category references are cleaned up. Entities that were only referenced by the evicted article get removed entirely; shared entities keep their remaining references.

This means the cache naturally gravitates toward the articles agents actually use, while still maintaining a broad knowledge base from the watched topics.

Rate Limiting

Wikipedia asks that API consumers be polite. We enforce a strict 1-request-per-second rate limit via an asyncio.Semaphore, with a custom User-Agent header identifying AitherOS. The rate limiter serializes all concurrent requests through a single gate, so even if multiple topics are being ingested simultaneously, we never hammer the API.

The Full Pipeline

WikipediaGraph doesn't exist in isolation. It completes a pipeline we've been building for months:

EXTERNAL SOURCES                    INGESTION                   UNIFIED STORE
-----------------                   ---------                   -------------
Wikipedia REST API -----> WikipediaGraph (6h)  ---+
                                                  |
NewsData, NewsAPI,                                |
WorldNews, TheNewsAPI, -> NewsWire (30min)  ------+ GraphSyncBus
GNews                                             | (5s flush)
                                                  +---> KnowledgeGraph:8196
Bluesky Firehose -------> NewsGraph (5min)  ------+           |
Moltbook Feed ----------> NewsGraph (10min) ------+           |
                                                         CrossDomainLinker
                                                         LiveNewsCache
                                                         CompetitiveIntel
                                                         GraphTrainingHarvester

Five external data sources. Three ingestion paths. One unified knowledge graph. Agents get context from all of it automatically through the same ContextPipeline that assembles their system prompts.

The CrossDomainLinker is where the magic happens. When WikipediaGraph syncs an article about "transformer architecture" and CodeGraph has indexed a file called transformer_attention.py, the linker discovers the relationship. Now an agent working on that file can see not just the code's call graph and test coverage, but the conceptual knowledge behind what it implements.

AgentKernel Integration

WikipediaGraph plugs into the same tick-based warming system as every other faculty graph. The AgentKernel runs a 30-second tick loop, and each graph gets a staggered start to avoid thundering herd problems:

GraphStaggerRefresh Interval
CodeGraphImmediate30 minutes
Faculty graphs (Service, Doc, etc.)90 seconds45 minutes
MediaGraph150 seconds60 minutes
WikipediaGraph210 seconds6 hours
Model warmingImmediate20 minutes

On first boot, WikipediaGraph waits 210 seconds (after MediaGraph has had time to start), then performs a full topic ingestion. Every 6 hours after that, it runs an incremental refresh that only fetches new or stale articles (articles older than 24 hours get re-checked, and only re-indexed if their content hash has changed).

The heavy task semaphore prevents WikipediaGraph from competing with other background tasks. If CodeGraph is mid-reindex, WikipediaGraph waits for the next tick.

Bulk Pipeline: The Foundation Layer

For foundational knowledge loading, we have a separate bulk pipeline at scripts/wikigraph_pipeline.py. This downloads full Wikidata and Wikipedia dumps (tens of millions of entities), streams them through entity extraction, builds GraphRAG community summaries using union-find clustering, and feeds everything into KnowledgeGraph via batch ingestion.

The bulk pipeline handles the one-time foundational load. WikipediaGraph handles the ongoing, API-based updates. Together, they give agents both breadth (millions of entities from Wikidata) and freshness (regular article updates from the REST API).

What This Actually Looks Like

When an agent encounters "attention mechanism" in a code comment, here's what happens:

  1. WikipediaGraph has already ingested the Wikipedia article on attention mechanisms
  2. The article is synced to KnowledgeGraph as a wikipedia.article node
  3. Entities like "Transformer", "Self-Attention", "Query-Key-Value" are synced as wikipedia.entity nodes
  4. Categories like "Neural network architectures" and "Natural language processing" are synced as wikipedia.category nodes
  5. CrossDomainLinker discovers that the code file references concepts in the Wikipedia article
  6. The agent's context pipeline includes this knowledge when assembling the response

The agent doesn't just know the code. It knows what the code is about.

Numbers

  • 15 faculty graphs syncing through GraphSyncBus
  • 20 watched topics covering AI, ML, security, infrastructure
  • 1000-article LRU cache with entity and category cross-referencing
  • 3 node types: article, entity, category
  • 1-second rate limit on Wikipedia API calls
  • 6-hour refresh cycle with content-hash-based change detection
  • 145 tests passing across 20 test classes

What's Next

The immediate win is cross-domain linking between Wikipedia knowledge and our codebase. But the infrastructure supports much more. The same faculty graph pattern could ingest arXiv papers, Stack Overflow answers, or domain-specific knowledge bases. The GraphSyncBus doesn't care where the data comes from -- it just needs nodes with IDs, types, and properties.

The real-world intelligence pipeline is now complete: current events from news APIs, social awareness from Bluesky and Moltbook, and foundational knowledge from Wikipedia. Our agents are no longer blind to the world they're building for.

Enjoyed this post?
Share