Lights Out, Lights On: The Complete Dark Factory Story
Five blog posts. Seven failed pipeline runs. Thirteen parallel AI workers. One shipped storefront. Zero regrets.
This is the capstone. The final entry in the Dark Factory series — a retrospective on the complete arc of building real software with AI agent swarms, from first principles to a live product at wildroot.aitherium.com. Not a demo. Not a proof of concept. A real application for a real business, running in production, processing real inventory.
If you've followed the series, this ties it all together. If you're new here, this is the one post that covers the whole journey.
The Arc
The Dark Factory series started with a deceptively simple question: can AI agents build production software if you give them the right context, the right tools, and the right orchestration?
The answer, it turns out, is yes — but. The "but" is where all the interesting lessons live.
Part 1: The Essay Principle
We started with context management — because without it, nothing else works.
The core insight: any competent agent can accomplish a complex task with 4,000 tokens of context, if those tokens are the right ones. We call this the Essay Principle. A surgeon doesn't read the patient's entire medical history before operating. They read a focused briefing: the relevant diagnosis, the specific procedure, the two prior surgeries that affect the approach. Maybe 2-3 pages. An essay, not an encyclopedia.
We built a five-tier memory architecture to make this practical:
- Tier 0 — the active prompt. Surgically curated for each request. Small, fast, merciless.
- Tier 1 — hot cache. 30-minute TTL. Recently used context with instant recall.
- Tier 2 — working memory. Session-scoped with semantic search.
- Tier 3 — long-term memory. Cross-session, 7-day decay curve.
- Tier 4 — knowledge graph. Permanent. Relational. The bedrock.
The critical property: movement between tiers is always a demotion, never a deletion. Nothing is ever lost. It just becomes harder to access. Content flows up and down the tier stack continuously, positioning itself for maximum retrieval speed based on predicted future relevance.
Query-conditioned relevance scoring handles the curation problem — every piece of candidate context gets scored against the current query, not against the conversation topic in general. The prompt reshapes itself around each query like water filling a container.
This architecture is what makes everything else in the series possible. Without surgical context management, the agents that plan, build, review, and refine would drown in irrelevant information. The Essay Principle isn't a nice-to-have. It's the foundation.
Part 2: It Wrote Its Own PR
With context management solved, we tackled autonomous maintenance.
Three GitHub Actions workflows. Two AitherOS agents. One self-hosted runner. Zero human prompting.
A scheduled bug-hunter workflow scanned the codebase for anti-patterns — hardcoded URLs, shell injection risks, documentation drift. It filed four issues automatically, with correct labels, layer classification, and enough context for someone to act.
An agent-dispatcher workflow picked up one of those issues and routed it to the Demiurge agent running on a self-hosted runner with full access to our AI stack. Demiurge wrote a fix, created a branch, committed, opened a PR.
Then the Atlas PR Guardian fired. Atlas read the diff, checked against architectural rules, and posted a structured review with five checks. The verdict: REQUEST CHANGES.
Atlas caught that the fix was incomplete — it patched one function but missed the companion function called in the same code path. It identified that the agent used a hardcoded URL replacement when an existing utility function already handled environment-aware URL resolution. It found that the test used a relative path that would break in our test runner.
The review even included a process improvement suggestion for the agent that wrote the code.
This was the moment the autonomous CI/CD loop closed. Discovery → fix → review, with quality gates at every step and zero human intervention. The reviewer was honest enough to reject the fix when it was wrong. That honesty is the whole point.
Part 3: The Dark Factory Pattern
With maintenance automated, we aimed higher: can agents build an entire application from a scope of work?
The Dark Factory pattern splits the problem in two:
- Phase 1 (0-80%): Cheap, fast, parallel. Thirteen AI workers on cloud CI/CD runners, each executing a self-contained work unit from a decomposed plan. $3 total. 11 minutes wall-clock.
- Phase 2 (80-100%): Expensive, thorough, authoritative. Local intelligence stack with full codebase context, reasoning models, and real tool access.
Three agents plan: an Architect who designs the system, a Designer who adds the human layer, and a Code Planner who decomposes the work into independent units. The Code Planner has the hardest job — it must embed enough context into each work unit that thirteen standalone models can produce code that integrates correctly, with none of them seeing each other's output.
Five reviewers refine: code review with full codebase context, security audit with real tooling, integration testing against the actual project, targeted fix pass, and a final judge who renders the verdict.
The key realization: CI/CD runners are general-purpose compute with network access, filesystem access, and authentication already wired up. You don't need a dedicated AI agent platform. You need a workflow file and a Python script.
Part 4: The First Run
Then we actually ran it. On a real scope of work. 284 lines. Ten modules. A full-stack inventory management system for a botanical products business.
It failed six times before it worked.
Run 1: GitHub's secret scanner silently redacted the plan JSON because work unit prompts mentioned keywords like SECRET_KEY and PASSWORD. The plan output vanished.
Run 2: Thirteen parallel jobs hit the model API rate limit simultaneously. The Retry-After header said "0" and our retry logic interpreted that as "retry immediately." Chaos.
Run 3: Phase 2 refinement was routed to the wrong endpoint. Hydra (a solo code reviewer) instead of the factory refinement pipeline. It completed successfully — just not the right thing.
Run 4: The collector merged all output into the repository root instead of under the expedition directory. Kotlin build files next to AitherOS configs.
Run 5: Full git clone on a massive repository. Ten-minute timeout on a job that needed three seconds of actual work.
Run 6: The refinement agents could only see 6% of the generated code (truncated diff), couldn't run tests (pytest not installed), had no architectural context (plan not mounted), and rejected the output. Technically correct — they didn't have enough information to approve.
Run 7: Success. The pipeline produced 67 files, 2,989 lines of code. Backend architecture was production-quality — modern async FastAPI with SQLAlchemy 2.0, clean separation, JWT auth with RBAC. Domain model correctly captured the full business logic. Integration bugs existed (about 15 across the codebase), but the architecture was right and the patterns were right.
The lesson: we spent more time fixing the orchestration pipeline than the agents spent writing the code. The factory workers are the easy part. The infrastructure — CI/CD integration, artifact management, branch strategy, rate limiting, volume mounting — that's where the actual engineering lives.
But we fixed it once. The next expedition doesn't hit any of those problems.
Part 5: No Frontend Required
And then came the post that changed the framing of the entire series.
When the Scope of Work for Wildroot Alchemy was written, it defined exactly ten modules. Products, Supplies, Recipes & BOM, Production Batches, Market Sales Tracking, Customer Management, Online Sales Import, Reporting & Alerts, a Data Model, and an Implementation Timeline.
Not a single pixel was specified.
No storefront. No brand identity. No navigation. No hero images. No ambient forest audio. No dark mode. No blog. The spec read like a database schema with API endpoints — because that's exactly what it was.
And when we audited the build against the SOW, the backend was 85% complete. Every model existed. Every CRUD endpoint worked. The data layer was solid. The frontend? 40% shown. Four entire pages missing.
The agent swarm built exactly what the spec defined. And nothing more.
This is the Dark Factory pattern in its purest form: the spec defines the machine, not the soul. A dark factory runs with the lights off. The machines don't need ambiance. They need correct inputs, reliable processes, and quality outputs.
But the moment you point a browser at it, you're no longer in the factory. You're in the store. And a store selling handcrafted botanical remedies needs to feel like something.
Everything you see on wildroot.aitherium.com right now — the forest canopy aesthetic, the Shopify headless integration, the dark mode with flash prevention, the ambient audio, the blog, the settings page, the navigation, the brand story — none of it was in the SOW. All of it was necessary. All of it required a human in the loop.
What We Actually Shipped
Let's be concrete about what exists today at wildroot.aitherium.com:
The Dark Factory (lights off):
- FastAPI backend with full inventory management
- Products, Supplies, Recipes with bill-of-materials linking
- Production Batches that transactionally deduct supplies and increase inventory
- Market/Event Sales with POS workflow and payment tracking
- Customer management with purchase history
- Low-stock alerts and reporting
- Shopify product sync via GraphQL API
- Blog CMS with admin CRUD and published/draft status
- Settings management with encrypted Shopify credentials
- JWT authentication with role-based access control
- PostgreSQL with Alembic migrations
- Docker Compose deployment on a VPS
The Storefront (lights on):
- Immersive brand experience with parallax hero sections
- Shopify headless commerce pulling real product data
- Full CSS variable system with dark mode and flash prevention
- Ambient forest audio toggle
- Blog/Journal with rich markdown rendering
- Settings page with live integration status indicators
- Scroll-aware navigation header
- Brand story page with process narrative
- Contact page with brand-appropriate copywriting
- Responsive design across all breakpoints
The Infrastructure:
- AitherOS mesh-ready load balancer with six routing strategies
- Circuit breakers, session affinity, and health probing
- Prometheus metrics export
- Infrastructure that handles one node today and is ready for a hundred
The Real Lessons
After five posts, seven pipeline failures, and one shipped product, here's what we actually learned:
1. Context is everything. The Essay Principle proved itself at every layer. Agents with 4,000 tokens of the right context outperform agents with 128,000 tokens of everything. Surgical curation isn't optimization — it's a prerequisite for agents that actually work.
2. Orchestration is harder than generation. The AI wrote good code in 11 minutes. Making CI/CD infrastructure, model APIs, git workflows, Docker containers, and artifact management work together reliably in a zero-touch pipeline? That took six hours of debugging across six failed runs. The agents are the easy part. The plumbing is the engineering.
3. Agents are brutally literal. They build what you specify with extraordinary precision. They don't add a blog because "it would be nice." They don't choose a font because it evokes botanical luxury. If it's not in the spec, it doesn't exist. This is a feature when you want reliable execution. It's a gap when you want a product people love.
4. Taste is the last human monopoly. The gap between "correct software" and "software people want to use" is taste. And taste — for now — still requires a human who writes specs that include the soul alongside the schema. The agents can build the dark factory. A human has to turn on the lights in the store.
5. Fix it once, fix it forever. Every pipeline failure we debugged — secret sanitization, rate limit jitter, output packaging, shallow clones, plan loading — is permanently solved. The infrastructure cost is amortized across every future expedition. The second project doesn't hit any of these problems. Neither does the tenth.
6. The autonomous CI/CD loop works — with honest reviewers. An agent that discovers bugs, writes fixes, and reviews its own code is useful only if the reviewer is honest enough to reject bad fixes. Atlas caught an incomplete fix and requested changes. That honesty is worth more than the automation that generated the fix.
7. Build infrastructure before you need it. We built a load balancer with six routing strategies for a single-node deployment. We built a tiered memory system before we had enough conversations to fill it. We built Dark Factory orchestration before we had a client project. Every time, the infrastructure was ready when the demand arrived. The alternative — adding infrastructure under load — is how systems break.
The Numbers, Honestly
| Traditional | Dark Factory | |
|---|---|---|
| Estimate | 6-8 weeks, 60,000 | ~1 week total |
| Phase 1 (13 parallel workers) | N/A | 11 minutes, $3 |
| Infrastructure debugging | N/A | ~6 hours (one-time) |
| Phase 2 refinement + polish | Included in 6-8 weeks | ~2 days |
| Storefront & brand (human) | Included in 6-8 weeks | ~3 days |
| Total compute cost | 60K labor | Under $10 |
The factory didn't produce a finished product in 11 minutes. It produced a foundation — a correct, well-architected backend with working business logic — at the speed of CI/CD. A human then spent a few days adding the soul: the brand experience, the storefront, the ambient atmosphere, the editorial voice.
That's still a 90%+ reduction in time-to-ship. And it's an honest accounting, not a cherry-picked demo.
What's Next
The Dark Factory series is complete. The pattern is proven. The product is shipped. But the work continues:
Shopify order import automation. The SOW specified it. The Dark Factory didn't build it (it was in a later phase). It's next.
Event-based sales workflows. The POS system exists in the backend. The mobile-optimized market sales UI is coming — designed for one-thumb operation at a farmers market booth, exactly as the Designer agent originally specified.
Backup and recovery. Not in the original spec. Not caught by any code review. Identified during the storefront build as obviously necessary. This is what happens when humans stay in the loop — they notice the gaps that agents can't see because those gaps were never specified.
Dark Factory v2. Tool-augmented workers that can syntax-check their own output. Integration verification gates before refinement. The full Forge coding swarm (11 specialized agents in a 4-phase pipeline) for deep refinement instead of single-pass review. Human-gated workflow with issues as operational ledgers and PRs as code-review surfaces.
More expeditions. The orchestration pipeline is built. The infrastructure is amortized. The next scope of work goes from document to deployed code faster, cheaper, and with fewer surprises. The factory is ready for its next project.
The Capstone
The Dark Factory pattern works. Not perfectly. Not without failure. Not without human judgment at the critical moments. But it works.
You write a spec. Three agents plan the architecture. Thirteen workers build in parallel on infrastructure you already pay for. Five reviewers refine with real tools and honest criticism. A human adds the taste, the brand, the soul — the parts that make software worth using instead of merely correct.
The lights stay off in the factory. The machines don't need ambiance. But out here in the store — where the forest canopy sways, the ambient audio whispers, and the botanical remedies glow in dark-mode warmth — the lights are very much on.
We built the machine. Then we built the experience.
That's the whole story.
This is the final installment of the Dark Factory series. Previous entries: Surgical Context Management, Autonomous CI/CD, The Dark Factory Pattern, First Run: What Broke, What Worked, and No Frontend Required.
David Parkhurst builds AI systems that build software at Aitherium Labs.