We Shipped an App in 11 Minutes. Here's What Broke, What Worked, and What's Next for the Dark Factory.
We Shipped an App in 11 Minutes. Here's What Broke, What Worked, and What's Next for the Dark Factory.
Published by Aitherium — March 11, 2026
Previously, on "Agents Writing Software"
This post is part four of a series. If you're new here:
Part 1: Surgical Context Management — We built a tiered memory system so agents work from curated briefings instead of encyclopedia-dumps. The Essay Principle: any competent agent can accomplish a complex task with 4,000 tokens of context, if those tokens are the right ones.
Part 2: It Wrote Its Own PR and Then Reviewed It — We closed the autonomous CI/CD loop. A GitHub Actions workflow discovered bugs, dispatched an agent to write a fix, created a PR, then dispatched a different agent to review it — and the reviewer rejected the fix for being incomplete. Zero human prompting.
Part 3: The Dark Factory Pattern — We described the theory: multi-agent planning, parallel execution on CI/CD, local refinement. The architecture.
This post is Part 4: what happened when we actually ran it.
The Mission: Build Wildroot Alchemy
A real client scope of work. 284 lines. Ten modules. Android app with Kotlin and Jetpack Compose. Cloud backend with Python FastAPI and PostgreSQL. Offline sync. Multi-channel sales tracking (market booths, Shopify, Etsy). Production batches that consume supplies and produce products. Recipes as bills of material. Low stock alerts. Reporting. JWT authentication with role-based access. The full stack.
The expedition plan decomposed this into 13 work units:
| WU | What It Builds | Files |
|---|---|---|
| 01 | Database models, config, migrations | 13 |
| 02 | JWT auth, password hashing, RBAC | 4 |
| 03 | Product & supply CRUD endpoints | 2 |
| 04 | Recipe engine, production batches | 3 |
| 05 | POS sales flow, customer tracking | 3 |
| 06 | Reports, alerts, app wiring, Docker | 6 |
| 07 | Android project setup, DI, theme, nav | 14 |
| 08 | API client, DTOs, Room database, DAOs | 17 |
| 09 | Dashboard, product screens, supply UI | 11 |
| 10 | POS sales screen, recipe builder, batch flow | 14 |
| 11 | Full pytest suite | 8 |
| 12 | STRIDE threat model, security tests | 2 |
| 13 | README, API docs, deployment guide | 5 |
All 13 work units launch simultaneously as parallel jobs on GitHub Actions. Each job calls GPT-4o through GitHub's built-in Models API — no external API keys, no extra cost beyond what you already pay for your development platform subscription.
Target: a single feature branch with a complete, integrated codebase. Under 15 minutes. Under $5.
Here's what actually happened.
Run 1: The Invisible Output
The plan job ran perfectly. Atlas designed the architecture, decomposed the work units, produced a 44KB JSON plan. Then the output vanished.
No error. No failure. The plan JSON just... wasn't there when the factory workers tried to read it.
Root cause: GitHub's secret scanner. The plan JSON contained work unit prompts that mentioned SECRET_KEY, PASSWORD, JWT_ALGORITHM — because that's what the authentication module is. GitHub saw those keywords in the workflow output and silently redacted the entire artifact.
Fix: We added _sanitize_plan_for_output() that strips security-adjacent keywords from the output copy of the plan (the one that gets uploaded as a workflow artifact). The actual plan used by workers is unaffected. It's a hack. It works.
Lesson: CI/CD platforms are designed for human developers who never put secrets
in build artifacts. AI agents don't know that. Your orchestration layer needs
to sanitize AI output for the assumptions of the infrastructure it runs on.
Run 2: The Rate Limit Wall
12 of 13 work units completed successfully. WU-06 (reports + alerts + app wiring) hit a 429 Too Many Requests from the GitHub Models API. The collect job then failed because it tried to push to a branch that had moved.
Root cause: 13 parallel jobs all calling the same model API endpoint within a ~2 minute window. The API responded with Retry-After: 0 — which our retry logic helpfully interpreted as "retry immediately." All retries fired at once. Same result.
Fix: Minimum exponential backoff with jitter, regardless of what the Retry-After header says. Base delay of 5 seconds, exponential growth up to 60 seconds, ±20% random jitter. 8 retries instead of 5. The collect job now uses git checkout -B and --force-with-lease instead of assuming a clean branch.
Lesson: When a rate-limited API tells you "Retry-After: 0", it's lying.
Or rather, its retry header is calibrated for individual human users, not
13 parallel agents hammering the same endpoint. Always enforce a minimum
backoff floor.
Run 3: The Wrong Agent
All 13 work units passed. The collect job merged everything into a feature branch. The notify job triggered Phase 2 refinement on our local self-hosted runner. Phase 2 returned: HAS_CHANGES: false.
That's... wrong. There were 63 files and 2,728 lines of new code.
Root cause: The Phase 2 workflow was routing refinement to the wrong endpoint. It was calling /forge/dispatch/sync with the Hydra agent (a solo code reviewer) instead of /factory/refine (the full 6-stage intelligence pipeline: review → security → tests → fixes → judge). Hydra looked at the diff, saw new files, and — because it's a reviewer looking for changes to existing code, not a factory evaluator — concluded there was nothing to review.
Fix: Route Phase 2 to /factory/refine. Pre-compute the diff on the runner (where git works normally) and pass it in the request body (since Genesis runs in Docker with a read-only .git mount).
Lesson: "It worked" and "it ran the right thing" are different assertions.
The workflow completed successfully. It just called the wrong service.
Always verify that the endpoint your pipeline calls is the endpoint you think
it calls.
Run 4: The Root Dump
All 13 work units passed again. The collect job merged them. Into the repository root.
backend/ next to AitherOS/. android/ next to AitherZero/. docs/ colliding with our existing docs/. A 100,000-file repository suddenly had Kotlin build files at the top level.
Root cause: The collect job wrote files exactly where the workers wrote them — at the paths specified in the plan. The plan said backend/app/main.py and that's what it got. But the plan was for the Wildroot project, which lives at expeditions/wildroot-alchemy/. Nobody told the collector to package the output under the expedition directory.
Fix: Added output_root to the plan JSON schema. The collector now prefixes all file paths with the output root. backend/app/main.py becomes expeditions/wildroot-alchemy/backend/app/main.py.
Lesson: Parallel workers produce output relative to their own context.
The orchestration layer must define where that output goes in the global
namespace. This is the "deployment target" problem, and it's the same
problem whether you're deploying Docker containers or AI-generated source files.
Run 5: The Infinite Clone
All 13 work units passed. The collector packaged everything under the correct directory. Then the collect job timed out after 10 minutes — on git fetch.
Root cause: fetch-depth: 0. Full clone. Our repository is massive (thousands of files, long history). The collect job needs exactly one commit's worth of context. It was downloading the entire history of the project to merge 63 files.
Fix: fetch-depth: 1. Shallow clone. The collect job doesn't need history. It needs a workspace.
Lesson: Every CI/CD job should use the minimum fetch depth required for
its operation. This is doubly true for jobs that are part of a pipeline
with tight time budgets. A 10-minute timeout is generous for a human
developer; it's an eternity wasted for a collector job that needs 3 seconds
of actual work.
Run 6: The Hollow Review
All 13 work units passed. Collect succeeded. Files properly packaged. Phase 2 refinement triggered. The refinement returned: REJECT. Review length: 270 characters.
270 characters. For a 63-file, 2,728-line codebase.
Root cause: Four problems compounding:
- pytest wasn't installed in the Genesis container. The test step failed immediately.
_run_tests()was hardcoded to look indev/tests/— AitherOS's test directory, not the project's. The Wildroot tests live atexpeditions/wildroot-alchemy/backend/tests/.- The diff was truncated to 8,000 characters. 63 files of code serialized to 121KB of diff. The agents saw ~6% of the output.
- The plan was
None. The_load_plan_by_id()function searched forfactory-plan.jsoninside the Genesis container, but theexpeditions/directory isn't mounted.
So the agents reviewed a tiny fragment of the code, couldn't run tests, had no architectural context, and rejected it. A correct decision, technically — they didn't have enough information to approve.
Fix: All four issues:
- Smart test discovery: walk the diff to find
test_*.pyfiles, derive the project root, auto-installrequirements.txt+ pytest - Increase diff limit to 32,000 characters (8x) with intelligent truncation
- Add
_load_plan_by_id()that searchesexpeditions/*/factory-plan.json - Auto-install pytest in the container if it's missing
Lesson: Refinement agents need the same support infrastructure as human
developers. A human reviewer wouldn't accept a code review where they
could only see 6% of the diff, couldn't run tests, and had no design doc.
Neither should your agents.
Run 7: Success (Sort Of)
We skipped GitHub Actions for this one and called /factory/refine directly from the local machine.
Result:
{
"status": "needs_revision",
"verdict": "ACCEPT",
"confidence": 0.5,
"review_length": 211,
"security_findings": 0,
"fixes_applied": 0,
"test_results": "directory not found",
"duration_ms": 289828
}
Verdict: ACCEPT. Confidence: 0.5. Tests couldn't find the directory because Genesis doesn't have the expedition files mounted. The review was thin (211 chars) because the agent compressed its output.
Five minutes of wall-clock time. The code review agent scored 1.0 (perfect) on AitherEval's trace analysis. The EffortScaler correctly classified it as effort level 7 (complex) and routed to the orchestrator model.
The pipeline works. The infrastructure around it is what failed six times.
The Actual Output: What Gemini Thinks
We ran an independent assessment using Gemini (completely outside our pipeline — no prompt engineering, just "evaluate this code against the SOW"). Here's the honest scorecard:
✅ What the Factory Nailed (7/10 — Good Foundation)
Backend architecture is production-quality. Modern async FastAPI with SQLAlchemy 2.0, clean separation into models/, routers/, services/, and auth/. JWT authentication with bcrypt hashing and RBAC (admin/staff). Every endpoint properly protected.
Domain model is accurate. The data model correctly captures the full business domain: Products with SKUs, Supplies with unit tracking, Recipes as bills of material with ingredient quantities, Production Batches that transactionally deduct supplies and increase product inventory, Sales Orders with line items and multiple payment methods, Customers with purchase history, and an InventoryAdjustment audit log that tracks every change with timestamp, user ID, and reason. This isn't placeholder code — it's a working inventory system.
Android foundation is solid. Kotlin + Jetpack Compose + Hilt DI + Room database. The project structure follows current Android best practices. The Room database mirrors the backend entities with proper DAOs. The navigation graph is correctly set up.
🟡 What Gemini Found Missing
Android UI is incomplete. Only Dashboard and Sales screens were generated. No Product CRUD screens, no Supply management, no Recipe builder, no Batch production flow. The SOW specified 10 modules; the Android client has UI for maybe 3.
No Shopify/Etsy integration. Phase 2 of the SOW required importing online sales from Shopify API and Etsy API. Zero code for this. No API clients, no OAuth flows, no import logic.
Offline sync is scaffolded but not implemented. Room database exists (that's the offline storage), but the sync logic — queueing changes while offline, conflict resolution, WorkManager-based background sync — isn't there.
🔴 What Gemini Missed
Here's where the independent assessment falls short:
Integration bug: app.routers.schemas doesn't exist. The products router imports from app.routers.schemas import ... but no schemas.py exists in routers/. There is a schemas.py in models/. This is a classic parallel-worker integration failure — WU-03 (CRUD endpoints) expected schemas in one location; WU-01 (data models) put them in another. The tests won't even load, let alone pass. Gemini didn't run the code, so it missed this entirely.
Missing __init__.py files. Several Python packages lack init files, which means imports fail at runtime.
No alembic.ini or migration files. The plan specified Alembic for database migrations. The models reference it in requirements. No migration configuration exists. The database won't initialize without manual table creation.
Docker Compose references unbuilt images. The docker-compose.yml references wildroot-backend but no Dockerfile exists. You can't docker compose up.
These aren't quality issues. These are "the app doesn't start" issues. They're exactly what Phase 2 refinement should catch — if the agents can actually see the code and run the tests.
The encouraging part is the count: this wasn't a 200-bug disaster. It was roughly 15 integration bugs across a freshly generated, multi-module codebase. That's honestly strong first-pass output for a parallel factory run. More importantly, they were the kind of bugs a frontier model can now polish quickly, cheaply, and mostly mechanically once the files are in one place and the runtime actually works.
The Real Numbers
| Metric | Planned | Actual |
|---|---|---|
| Planning time | ~4 min | ~4 min (via pre-built plan) |
| Factory time (13 parallel jobs) | ~11 min | 8-12 min per run |
| Factory compute cost | ~$3 | ~$3 per run |
| Infrastructure debugging time | 0 | ~6 hours across 6 runs |
| Files generated | 102 | 67 (source), 63 (in diff) |
| Lines of code | ~8,000 | 2,989 |
| Tests that pass | all | 0 (import error) |
| Runs to get Phase 1 right | 1 | 6 |
| Human intervention for pipeline | 0 | extensive |
Let's be honest: we spent more time fixing the orchestration pipeline than the agents spent writing the code. The factory workers are the easy part. The infrastructure — CI/CD integration, artifact management, branch strategy, rate limiting, environment configuration, volume mounting — that's where the actual engineering lives.
But here's the thing: we fixed it once. The next expedition doesn't hit any of these problems. Secret sanitization, retry with jitter, output packaging, shallow clones, plan loading — all permanently solved. The infrastructure cost was amortized.
Why This Still Matters (Despite the Failures)
A human developer quoting this project would estimate 6-8 weeks and 60,000.
In 11 minutes, for $3, the factory produced:
- A complete, well-structured FastAPI backend with 8 routers, 3 service layers, and JWT auth
- An Android project with Hilt DI, Room database, Jetpack Compose navigation, and Material 3 theming
- 8 test files with fixtures and async test infrastructure
- A STRIDE security threat model
- API documentation, data model docs, and a deployment guide
The code has bugs. The Android UI is incomplete. The integration layer has gaps. But the architecture is right. The domain model is right. The patterns are right.
A senior developer could take this output and make it production-ready in 2-3 days. Not 6-8 weeks. That's a 90% reduction in time-to-MVP, even accounting for the incompleteness.
The factory didn't produce a finished product. It produced a foundation — and it produced it at the speed of CI/CD, not the speed of human typing.
The Road to Dark Factory v2
Here's what we're building to close the remaining gaps. The core philosophy doesn't change: Phase 1 is cheap, parallel, and fast. Phase 2 is where the intelligence lives.
The GitHub Actions factory stays on GPT-4o via the GitHub Models API. Could we use smarter models there? Absolutely — the Models API gives you access to 40+ models including Claude, Gemini, and Llama. We're intentionally not. The factory workers are doing structural code generation. That's a speed-and-cost problem, not an intelligence problem. $3 and 11 minutes for 13 parallel jobs is the right trade.
Every v2 improvement below targets Phase 2: the local refinement step, running through Demiurge (our orchestrator agent) and the Forge coding swarm — 11 specialized LLM agents executing a 4-phase pipeline: ARCHITECT → SWARM → REVIEW → JUDGE. That's where reasoning, tool access, and verification live.
1. Delivery Packaging (Implemented)
The first Dark Factory v2 upgrade is already live: the collect job now creates a delivery archive so the output is usable without a git checkout.
What's shipping now: the collect step produces a .tar.gz, injects a MANIFEST.json with plan metadata and a file inventory, uploads it as a 30-day GitHub Actions artifact, and posts the download location in the draft PR body. No git clone required.
It also creates a tracking issue for the run. The issue becomes the canonical execution record — plan ID, branch, archive link, and gate status — while the PR stays focused on code review. That's a cleaner split: issues track the operation, PRs carry the code.
What's next: make that archive self-starting with generated setup.sh / run.sh scripts and eventually publish a runnable Docker image to GHCR.
2. Demiurge + Forge Swarm Refinement
Run 7 proved the refinement pipeline connects. It also proved it's shallow — a single-pass review that returned 211 characters and 0.5 confidence. That's not refinement. That's a rubber stamp.
v2 refinement runs through Demiurge and the full Forge coding swarm. Demiurge is our orchestrator agent. The Forge is the SwarmCodingEngine — 11 specialized LLM agents organized in a 4-phase pipeline:
- ARCHITECT — Analyzes the factory output against the original plan. Identifies structural gaps, integration failures, missing modules.
- SWARM — Multiple specialist agents attack the gaps in parallel: one fixes import paths, another writes missing tests, another fills in incomplete UI screens, another wires up missing API integrations.
- REVIEW — Code review agents evaluate the swarm's fixes against the project's architecture and coding standards.
- JUDGE — Final quality gate. Synthesizes all review feedback into a verdict with confidence score.
The Forge operates in forge mode (full tool access — file reads, web search, code execution) rather than llm mode (pure text generation). That means Phase 2 agents can actually run the code they're reviewing.
The intelligence methodologies applied during refinement:
- SASE (Sparse Attention with Synthetic Experience) for code review — our cognitive architecture for deep analysis that maintains attention coherence across large codebases
- CUGA (Contextual Understanding through Grounded Abstraction) for security audit — grounding abstract security principles in the concrete code being reviewed
- TDD-driven fix pass — the swarm doesn't just "fix issues from the review." It writes a failing test first, then writes the fix that makes it pass, then verifies. Test-Driven Development as an agent methodology, not just a human practice.
- DeepThink isn't a model — it's an applied methodology. Before generating fixes, the agents explicitly plan their approach, identify edge cases, consider alternatives, and document their reasoning. The thinking happens in structured steps, not in the model's hidden chain-of-thought.
The key insight: reasoning is a methodology, not a model selection. You don't just pick a "smarter model." You give the agents a structured process — decomposition, hypothesis testing, explicit uncertainty acknowledgment — and apply it through the swarm pipeline regardless of which model backs the individual agents.
3. Tool-Augmented Refinement
The factory workers (Phase 1) are pure text generation — they receive a prompt and produce code. That's fine. They're fast and cheap and the output is structurally sound.
The problems the factory creates — like the app.routers.schemas import bug where WU-03 imported from a path that WU-01 never created — are integration problems. Workers can't see each other's output during parallel generation. That's inherent to the parallel model, and it's Phase 2's job to fix.
The Forge swarm runs in forge mode with full tool access. During refinement, the swarm agents can:
- Read the actual generated files — not a truncated diff, but the real source tree
- Run syntax checks — Python AST parsing, import resolution, linting
- Execute tests — spin up pytest, read stack traces, iterate on fixes
- Query package registries — verify that dependencies exist at the specified versions
- Search documentation — look up current API specs for frameworks referenced in the code
- Cross-reference the plan — check each work unit's output against the architectural decisions in the factory plan
This is what forge mode means in the SwarmCodingEngine: the agents aren't generating code in a vacuum. They're operating like developers with an IDE. The import path bug gets caught in the first 30 seconds of refinement because the review agent tries to resolve from app.routers.schemas import ... and discovers the file doesn't exist.
4. Integration Verification Gate
The first thing Demiurge does when Phase 2 starts — before the Forge swarm touches anything — is run a structural verification pass:
- Syntax check: Does every file parse without errors? (Python AST, Kotlin compiler frontend, XML validation)
- Import resolution: Do all imports resolve? Check cross-file references.
- Schema consistency: Do API endpoint signatures match the models they reference?
- Test runner: Do the tests at least load? (Not pass — just load without import errors.)
This verification produces a concrete error manifest. The Forge swarm's ARCHITECT phase uses this manifest to create targeted fix assignments — instead of asking agents to "review the code and find problems," it says "here are 4 broken imports, 2 missing init files, and a schema mismatch. Fix them." Specific, verifiable, actionable.
If the structural damage is severe enough (>30% of files have errors), Demiurge can flag specific work units for re-generation with the error context injected into their prompts. The factory re-runs only the broken units, not all 13.
5. Human-Gated Workflow
The goal isn't full autonomy. The goal is human-gated autonomy: the factory runs completely unattended, but key checkpoints require human approval before proceeding.
SOW received → [HUMAN: approve plan] → Factory runs → Integration gate (auto)
→ Refinement runs → [HUMAN: review verdict] → Fix pass (auto) → Final build
→ [HUMAN: approve deployment] → Ship
Three human gates:
- Plan approval — "Yes, this architecture matches what the client wants"
- Review approval — "Yes, this code is acceptable quality for delivery"
- Deploy approval — "Yes, ship this to the client"
Everything else — the 13 parallel workers, the code review, the security audit, the test suite, the fix pass, the repackaging — runs without human involvement. The factory does the work. The human does the judgment. In v2, the issue is the operational ledger for those gates, and the PR is the artifact review surface.
6. Runtime Verification
Generated code that compiles isn't generated code that works. The Forge swarm's final phase includes runtime verification:
- Backend: Spin up a test PostgreSQL container, run Alembic migrations, start the FastAPI server, execute the test suite against a real database. Not mocked. Real.
- Android: Run Gradle build. If it compiles, run the unit tests. Instrumented tests require an emulator (future work).
- Integration: Hit the backend's health endpoint from a test client. Verify auth flow end-to-end.
If the runtime verification fails, the SWARM phase gets the actual error output — stack traces, assertion failures, HTTP status codes — not a reviewer's opinion about what might be wrong. The swarm iterates: fix → verify → fix → verify, until the tests pass or the JUDGE calls it and reports what remains broken.
The Broader Pattern
What we built with the Dark Factory isn't specific to AitherOS. The pattern is general:
- Decompose a large task into independent work units
- Execute those work units in parallel on cheap compute (the factory — fast, dumb, and intentionally so)
- Verify the integrated output against real constraints (structural checks, not vibes)
- Refine with concentrated intelligence via an orchestrator and specialist swarm (Demiurge + Forge — slow, smart, and tool-augmented)
- Gate key decisions for human judgment
This is how manufacturing works. This is how military logistics works. This is how any complex system that needs to be both fast and reliable works: parallelize the predictable, concentrate intelligence on the unpredictable, and put humans at the decision points.
The six failures taught us that the hard part isn't the AI. The AI wrote good code in 11 minutes. The hard part is the orchestration: making CI/CD infrastructure, model APIs, git workflows, Docker containers, and artifact management all work together reliably in a pipeline that involves zero human touch between the gates.
That's an engineering problem, not an AI problem. And engineering problems, we know how to solve.
Try It
The Dark Factory pipeline is part of the AitherOS expedition system. The Wildroot Alchemy output is at expeditions/wildroot-alchemy/ in the repo.
Want to run your own expedition? Write a scope of work, put it in expeditions/{name}/SOW.txt, and create a factory-plan.json with your work units. The rest is infrastructure.
We'll be publishing updates as the Demiurge + Forge swarm refinement pipeline matures. Follow the series.
The Aitherium team builds AI systems at Aitherium Labs.