Early Access Preview—AitherOS is in active development. Features may change, break, or disappear.

Invite Only

Theme

GitHub

Live Demo

Invite Only

Theme

GitHub

Back to blog

engineeringdark-factoryagentsci-cdautomationarchitecture

The Dark Factory Pattern: How We Ship an Entire App in Under an Hour Using AI Agent Swarms on CI/CD Infrastructure

Name: AitherOS
Author: Aitherium

March 10, 202616 min readAitherium

The $50,000 Problem

A client hands you a 284-line scope of work. Ten modules. Android app. Cloud backend. Offline sync. Multi-channel sales tracking. Reporting. Security. The full stack.

Traditional estimate: 6-8 weeks, two developers, $40,000-$ 60,000.

We built the first 80% in eleven minutes. The remaining 20% took forty more. Total wall-clock time from SOW to working code: under an hour. Total compute cost: about $3.

This isn't a thought experiment. This is a pattern we run in production. We call it the Dark Factory.

What Is a Dark Factory?

In manufacturing, a "dark factory" is a fully automated facility that runs with the lights off. No humans on the floor. Machines receive instructions, execute, and ship product.

We applied the same concept to software engineering. Instead of manufacturing machines, we use AI agents. Instead of a factory floor, we use CI/CD infrastructure you're already paying for.

The insight is simple: most of the code in a new application is predictable, parallelizable, and doesn't require creative judgment. Data models, CRUD endpoints, API clients, DTOs, test scaffolding, Docker configs, documentation --- this is all structural work. It follows directly from the architecture.

The hard parts --- business logic edge cases, UX polish, security hardening, integration testing against real systems --- those require deep reasoning and full project context. But they only account for about 20% of the total code.

So we split the work:

Phase 1 (0-80%): Headless AI agents running in parallel on cloud CI/CD. Cheap, fast, disposable.
Phase 2 (80-100%): Local intelligence stack with full codebase context, reasoning models, and real tool access. Expensive, thorough, authoritative.

The Architecture

Three Agents Plan. Thirteen Workers Build. Five Reviewers Refine.

The pipeline has three stages: Plan, Factory, Refine.

Stage 1: Multi-Agent Planning

Planning is not a single prompt. It's a conversation between three specialized agents:

The Architect reads the scope of work and designs the complete system: data model with every entity, field, and relationship. API surface with every endpoint. Directory structure. Tech stack decisions. Component interfaces. Business logic flows. Edge cases.

This agent has access to code search across the existing codebase, web research for current framework best practices, and file system access to read existing patterns. It doesn't guess --- it investigates.

The Designer reviews the Architect's output and adds the human layer: user flows for every workflow, screen-by-screen UX guidance, accessibility considerations, offline behavior, branding direction, input validation rules, error messaging. Every decision a frontend developer would normally make through iteration, this agent makes upfront by reasoning about the user's actual workflow.

For the inventory tracking app, the Designer identified that the market sales screen was the most critical UX surface. The business owner would be standing at a booth, phone in one hand, making change with the other. The screen needed to work with one thumb. That insight shaped the entire POS interface design.

The Code Planner takes the combined architecture and design documents and decomposes them into discrete, self-contained work units. Each work unit is a complete set of instructions for a single AI agent to execute in isolation. No shared state. No inter-agent communication. Just a prompt containing everything that agent needs to produce complete, working code.

This is the hardest job. The Code Planner must embed enough context into each work unit's instructions that a standalone model can produce code that will integrate correctly with code produced by twelve other standalone models running in parallel, none of which can see each other's output.

For a ten-module inventory application, this produced thirteen work units:

#	Role	What It Produces
1	Backend Data	13 files: database models, config, migrations, dependency manifest
2	Backend Auth	4 files: JWT system, password hashing, RBAC middleware
3	Backend CRUD	2 files: product and supply endpoints with inventory adjustment
4	Backend Logic	3 files: recipe engine, production batch system with transactional supply deduction
5	Backend Sales	3 files: POS order flow, customer tracking, inventory service
6	Backend Wiring	6 files: reports, alerts, app entrypoint, Dockerfile, compose
7	Frontend Setup	14 files: build system, DI, theme, navigation, manifest
8	Frontend Data	17 files: API client, DTOs, Room database, DAOs, offline entities
9	Frontend UI A	11 files: dashboard, product screens, supply screens, shared components
10	Frontend UI B	14 files: POS sales screen, recipe builder, batch flow, reports, offline sync
11	Testing	8 files: full pytest suite with fixtures, auth, CRUD, batch, sales, report tests
12	Security	2 files: STRIDE threat model, security test harness
13	Documentation	5 files: README, API reference, deployment guide, data model docs, changelog

That's 102 files across two platforms, designed to integrate without any of the thirteen workers ever communicating.

Stage 2: The Dark Factory

This is where CI/CD becomes compute infrastructure.

Each work unit becomes a parallel job in a matrix workflow. All thirteen jobs launch simultaneously on cloud runners. Each job:

Downloads the shared plan artifact (architecture + its specific work unit instructions)
Calls a language model through the platform's built-in model API --- no external API keys, no extra cost beyond what you're already paying for your development platform subscription
Extracts code blocks from the model's output and writes them to disk
Uploads the generated files as a build artifact

A collector job then merges all thirteen outputs into a single feature branch and opens a draft pull request.

Time for all thirteen jobs to complete: 8-12 minutes. They run in parallel. The slowest worker (usually the POS frontend, because it's the most complex UX) is the bottleneck.

Cost: approximately $0.20-$ 0.30 per work unit. Thirteen units costs under $4. For 102 files of production-quality scaffolding.

The key realization: CI/CD runners are general-purpose compute with network access, filesystem access, and authentication already wired up. You don't need a dedicated "AI agent platform." You need a workflow file and a Python script.

Stage 3: Local Refinement

This is where the other 20% happens --- and where the real intelligence lives.

The factory output is good. It follows the architecture. The code compiles. The data models are correct. But it hasn't been tested against the actual project, hasn't been reviewed by something with full codebase context, and hasn't been hardened by a security auditor with real tool access.

The refinement stage pulls the factory branch locally and runs it through a gauntlet:

Code Review with full codebase context. Not just "does this code look right" but "does this code integrate correctly with the existing 203 services, follow established patterns, use the right port resolution, and handle the edge cases that our specific infrastructure creates." The reviewer has access to the full code graph --- call hierarchies, dependency trees, import chains.

Security Audit with real tooling. Not just "identify theoretical vulnerabilities" but "run actual injection tests, verify CORS headers, check that audit logs are append-only, confirm rate limiting works." The security agent has tool access --- it can execute code, make HTTP requests, inspect database schemas.

Integration Testing against the actual project. The factory output gets pytest run against it. Real database connections. Real API calls. Failures surface immediately.

Targeted Fix Pass for issues found. A coding agent reviews the specific findings from the review, security, and test stages, and makes surgical fixes. Not rewriting --- just fixing what the factory got 80% right but missed the last 20%.

Final Judgment. An arbiter agent reviews everything --- the original plan, the factory output, the review findings, the security report, the test results, the fixes --- and renders a verdict: ACCEPT (ship it), REVISE (needs more fixes), or REJECT (fundamental issues, re-run the factory).

Why This Works (And Why It Didn't Before)

Three things changed that made this pattern viable:

1. Models got good enough at structured output. Two years ago, asking a language model to produce thirteen files that integrate correctly was a coin flip. Today, with a sufficiently detailed prompt (which is what the Code Planner agent produces), the hit rate on compilable, architecturally-correct code is above 90% per work unit.

2. CI/CD platforms added model APIs. The factory workers don't need API keys, credit cards, or external accounts. They authenticate with the same token the workflow already has. This eliminated the biggest operational friction.

3. The planning stage became multi-agent. A single prompt cannot produce a 44KB plan with the level of detail needed for thirteen independent workers to produce integrating code. But three agents collaborating --- an architect who understands systems, a designer who understands users, and a code planner who understands decomposition --- consistently produce plans at the quality level required.

The Numbers

For the inventory tracking application (10 modules, Android + Python backend, 284-line SOW):

Metric	Value
Planning time (3 agents)	~4 minutes
Factory time (13 parallel jobs)	~11 minutes
Refinement time (5 review stages)	~40 minutes
Total wall-clock time	~55 minutes
Files generated	102
Lines of code (estimated)	~8,000-12,000
Factory compute cost	~$3
Human intervention	0 (review only)

Compare to traditional:

Metric	Traditional	Dark Factory
Time to first working build	2-3 weeks	< 1 hour
Cost to MVP scaffold	$15,000-25,000	$3 + compute
Developer hours for boilerplate	80-120 hours	0
Time spent on the interesting problems	20%	80%

The last row is the real point. The factory doesn't replace developers. It lets them skip the boring part and start where the work actually matters.

What the Factory Is Bad At

Honesty matters more than hype. Here's where this pattern breaks down:

Highly interconnected systems. If every component depends on every other component's internal state, you can't parallelize the work. The factory requires decomposability.

Existing codebase integration. The factory works best for greenfield projects or isolated modules. Integrating into a large existing codebase requires the kind of context that only the local refinement stage provides.

Creative UX. The factory produces functional UI. It does not produce delightful UI. The Designer agent gets you 80% of the way on UX decisions, but the last 20% --- animation timing, micro-interactions, that feeling of polish --- still requires human taste.

Novel algorithms. If the core of your application is a new algorithm that doesn't exist in training data, the factory will produce a plausible-looking implementation that's probably wrong. The planning stage will correctly identify this as a risk, but the factory workers won't solve it.

Compliance-critical systems. Healthcare, finance, aerospace. The factory is a starting point, not a certification. Every line still needs human review for regulated domains.

The Template

We've open-sourced the pattern (not the orchestration layer) as a reusable template. You need:

A planning pipeline that produces a structured JSON plan with self-contained work units
A CI/CD workflow with matrix strategy that fans out work units to parallel jobs
A lightweight worker script that calls your model API and extracts code from responses
A collector job that merges artifacts and opens a PR
A refinement pipeline that reviews, tests, and hardens the output

The planning pipeline is the hard part. The rest is plumbing.

What's Next

We're working on three extensions:

Iterative factory runs. When the Judge says REVISE, automatically re-run only the failed work units with the review feedback injected into their prompts. Currently this is manual. It should be a loop.

Cross-project learning. Every factory run produces data: which work units succeeded, which failed, what the common failure modes are, which prompt patterns produce the best code. We're feeding this back into the planning agents so they get better at decomposition over time.

Hybrid local-cloud. Some work units benefit from local GPU access (complex reasoning, code that needs to reference the existing codebase). Others are pure boilerplate that cloud runners handle fine. Dynamic routing based on work unit complexity.

The Takeaway

The gap between "AI can write code" and "AI can ship software" has always been orchestration. A single model in a single context window can write a function. But shipping software means coordinating dozens of concerns --- data models, API contracts, frontend state, test coverage, security, documentation --- that span hundreds of files.

The Dark Factory pattern solves this by decomposing the problem at the planning stage, parallelizing the predictable work, and concentrating expensive intelligence on the parts that actually need it.

You don't need a dedicated AI platform. You need agents that can plan, infrastructure that can execute, and intelligence that can refine. The tools already exist. The pattern is the product.

Aitherium Labs builds AI-native development infrastructure. If you're interested in the Dark Factory pattern for your team, reach out.

Enjoyed this post?

All posts Try AitherOS