It Wrote Its Own PR and Then Reviewed It: How We Closed the Autonomous CI/CD Loop
It Wrote Its Own PR and Then Reviewed It: How We Closed the Autonomous CI/CD Loop
Published by Aitherium
The Screenshot That Made Me Stop Working
Monday morning. I'm debugging an event loop stall in Genesis --- the kind of deep concurrency problem that requires full attention. I glance at GitHub notifications and see four new issues I didn't file, a pull request I didn't open, and a code review I didn't write.
The issues were real bugs. The PR was a legitimate fix. And the review --- the review was better than what most junior developers would write. It identified that the fix was incomplete, pointed to the correct existing utility function the agent should have used instead, and requested three specific changes before merge.
No human prompted any of it. I didn't write a ticket. I didn't assign anyone. I didn't even know these bugs existed.
I just defined some YAML workflows and set up a self-hosted runner. The system took it from there.
What Actually Happened
Here's the timeline, reconstructed from GitHub Actions logs:
12:43 PM --- A scheduled GitHub Actions workflow called bug-hunter runs against the develop branch. It scans the codebase looking for anti-patterns: hardcoded URLs that should use service discovery, shell=True with string concatenation, documentation that's drifted from config. The workflow uses AitherOS MCP tools to query the codebase via CodeGraph.
12:44 PM --- Four issues are filed automatically:
[quality] Documentation Drift: AitherNode port documented as 8080, services.yaml says 8090(#817)[bug-hunter] Security: shell=True with restart_command string injection in AitherWatch (7 locations)(#816)[bug-hunter] Security: shell=True with list in AitherAutonomic cleanup(#815)[quality] Documentation Drift: AitherNode port documented as 8080 but configured as 8090(#814)
Each issue has the right labels (auto-discovered, bug, priority:high, layer:1), correct layer classification, and enough context for a developer --- human or otherwise --- to act on it.
12:44 PM --- A second workflow triggers. This one watches for issues labeled agent:local and dispatches them to the Demiurge agent --- our code generation specialist --- running on a self-hosted GitHub Actions runner on my local machine. The runner connects to the full AitherOS stack: vLLM inference, CodeGraph indexing, all 200+ MCP tools.
The Demiurge agent receives a task: replace a hardcoded localhost:11434 Ollama URL in AitherReasoning.py with host.docker.internal:11434 so the service works inside Docker containers. It also needs to write a regression test.
12:44 PM --- The agent uses replace_in_file to patch _check_orchestrator_loaded() at line 2516, replacing http://localhost:11434 with http://host.docker.internal:11434. Then it uses write_file to create a 3-line test:
def test_no_hardcoded_localhost_ollama():
with open('services/cognition/AitherReasoning.py') as f:
assert 'localhost:11434' not in f.read()
The commit is authored by AitherOS Agent (demiurge). A PR is created from branch auto/demiurge/20260310-124406 targeting develop. Labels applied: needs-review, auto-fix, agent:demiurge. The PR body includes the full task description, mode (forge), effort level (5), and the GitHub Actions run ID for traceability.
12:45 PM --- A third workflow fires: the Atlas PR Guardian. Atlas is our maintenance review agent --- it reads the diff, checks it against architectural rules, and posts a structured review comment.
This is where it gets interesting.
The Review That Caught the Bug in the Fix
Atlas posted a review with five checks:
| Check | Status | Notes |
|---|---|---|
| Architecture | PASS | Layer 3 (Cognition) internal change only, no boundary violations |
| Security | PASS | No secrets, no injection vectors introduced |
| Code Quality | FAIL | Swaps one hardcoded URL for another; ollama_url() already exists in AitherPorts |
| Test Coverage | PARTIAL | Test added but fragile relative path; misses the unfixed companion function |
| Blast Radius | 1 service | AitherReasoning (Layer 3 Cognition) only |
Then it enumerated four specific issues:
Issue 1 --- Incomplete fix. The function _trigger_orchestrator_preload() at line 2535 --- called back-to-back with the patched function in evaluate_gate_with_llm() --- still hardcodes localhost:11434. Fixing one without the other means the preload will still fail in Docker.
Issue 2 --- Wrong approach. AitherPorts.py already exports ollama_url(), which auto-detects the correct host across environments (local dev, Docker via host.docker.internal, container-name-based networking). Hardcoding host.docker.internal breaks bare-metal dev environments. It's the same anti-pattern as localhost, just pointed at a different wrong address.
Atlas even showed the correct fix:
from lib.core.AitherPorts import ollama_url
# In _check_orchestrator_loaded():
resp = await client.get(f"{ollama_url()}/api/ps")
# In _trigger_orchestrator_preload():
await client.post(f"{ollama_url()}/api/chat", ...)
Issue 3 --- Fragile test path. The test uses open('services/cognition/AitherReasoning.py') --- a relative path that only works if pytest runs from the AitherOS/ directory. Tests run from the repo root via python -m pytest dev/tests/, so this will raise FileNotFoundError. The fix: use Path(__file__).parent.parent.parent / "services/cognition/AitherReasoning.py".
Issue 4 --- Test doesn't catch the remaining bug. Because line 2535 still has localhost:11434, the test the agent wrote would actually fail if it could find the file. The PR is internally inconsistent.
Verdict: REQUEST CHANGES.
Three items before merge:
- Apply the same URL fix to
_trigger_orchestrator_preload()(line 2535) - Use
from lib.core.AitherPorts import ollama_urlin both functions instead of any hardcoded host - Fix the test's relative file path to use
Path(__file__).parent
This is a legitimate, useful code review. Atlas correctly identified that the Demiurge agent's task scope was too narrow --- it fixed one function but missed the companion function called in the same code path. It knew about the existing utility function in a completely different module. It understood that the test was structurally broken for our test runner configuration.
The review even added a constructive note:
This PR was generated by the Demiurge agent. The agent correctly identified the localhost issue but the task scope was narrow enough that it missed the companion function. Worth adjusting the Demiurge task prompt to include "check all functions in the same file that reference the same external service."
That's not just a review. That's a process improvement suggestion for the agent that wrote the code.
The Architecture Behind It
Three GitHub Actions workflows, two AitherOS agents, one self-hosted runner. No orchestration service. No ticket system. No Slack channel. Just YAML and cron.
Workflow 1: Bug Hunter (Scheduled)
Runs on a schedule against the develop branch. Uses GitHub Copilot's coding agent capabilities combined with AitherOS MCP tools to scan for anti-patterns. When it finds something, it opens a GitHub issue with structured labels.
The key insight: the bug hunter doesn't try to fix anything. It just identifies and files. Separation of concerns.
Workflow 2: Agent Dispatcher (Issue-Triggered)
Watches for issues with the agent:local label. When one appears, it dispatches the appropriate AitherOS agent via the self-hosted runner. The runner has access to the full local stack --- vLLM for inference, CodeGraph for codebase search, all MCP tools for file manipulation.
The agent creates a branch, makes changes, writes tests, commits (as AitherOS Agent (demiurge)), and opens a PR. The PR body includes full provenance: task description, mode, effort level, run ID.
Workflow 3: Atlas PR Guardian (PR-Triggered)
Fires on every new PR. Reads the diff, classifies the change by architectural layer, checks against coding standards, and posts a structured review. For agent-generated PRs, it adds context about the generating agent and suggestions for prompt improvement.
Atlas can approve simple changes automatically. For anything that fails a check, it requests changes with specific, actionable items.
The Self-Hosted Runner
This is the piece that makes local agent dispatch possible. GitHub's hosted runners can't reach my vLLM instance or MCP tools. The self-hosted runner runs on the same machine as AitherOS, so agents get full tool access --- file system, code search, LLM inference, the works.
The runner authenticates via the SAML SSO we set up with AitherIdentity acting as our IdP. Same identity system, same security boundary.
What This Means
The traditional software maintenance loop looks like this:
Human notices bug → Human files ticket → Human assigns developer → Developer investigates → Developer writes fix → Developer opens PR → Reviewer reads PR → Reviewer approves or requests changes → Developer addresses feedback → Merge
That's nine steps, three humans, and typically 2-5 business days.
The autonomous loop:
Workflow discovers bug → Workflow files issue → Workflow dispatches agent → Agent writes fix → Agent opens PR → Workflow dispatches reviewer → Reviewer approves or requests changes
Seven steps, zero humans, under two minutes.
And critically: the review agent caught a real problem. This isn't rubber-stamp automation. Atlas identified that the fix was architecturally wrong (hardcoding a different host instead of using the existing utility function), structurally incomplete (missed the companion function), and had a broken test. Those are exactly the things a good human reviewer catches.
The fix still needs human judgment to merge. We're not auto-merging agent PRs --- that's a trust boundary we haven't crossed yet, and probably shouldn't cross until the review agent's track record is longer. But the discovery, implementation, and quality gate are all autonomous.
What Broke (And What We Learned)
The Demiurge agent's fix was wrong. Not catastrophically wrong --- it correctly identified the problem (hardcoded localhost breaks in Docker) and applied a change that would work in container environments. But it replaced one hardcoded URL with another hardcoded URL, when the correct solution was to use the existing ollama_url() function that handles all environments.
This tells us something important about agent task prompts. The task said: "replace http://localhost:11434 with http://host.docker.internal:11434". The agent did exactly what it was told. It didn't search for related functions in the same file. It didn't check whether a utility function already existed for this purpose.
The fix isn't to make the agent smarter. The fix is to make the task prompt broader: "Eliminate all hardcoded Ollama URLs in this file. Use the existing ollama_url() from AitherPorts for environment-aware URL resolution. Write regression tests that verify no hardcoded Ollama hosts remain."
This is the same lesson human engineering managers learn: the quality of the output depends on the quality of the task definition. The agent is a capable executor. The prompt is the specification.
The Stack
For anyone building something similar, here's what's running:
- GitHub Actions --- workflows for discovery, dispatch, and review
- Self-hosted runner --- on the same machine as the AI stack, for local tool access
- AitherOS MCP tools --- 200+ tools exposed via the MCP gateway for code search, file manipulation, git operations
- Demiurge agent --- code generation specialist, dispatched via AgentForge with ReAct loop and tool calling
- Atlas agent --- maintenance review specialist, reads diffs and checks against architectural rules
- vLLM --- local LLM inference (Nemotron-Orchestrator-8B, fine-tuned on our codebase patterns)
- CodeGraph --- AST-based code indexing for cross-file analysis
- AitherIdentity --- SAML IdP for GitHub SSO, same auth chain for human and agent access
Total new code written to enable this: about 400 lines of YAML workflow definitions. Everything else was already in the system --- we just pointed GitHub Actions at it.
What's Next
The obvious next step is closing the feedback loop. When Atlas requests changes on an agent PR, dispatch a second agent run with the review feedback injected into the task prompt. The agent fixes its own fix, Atlas re-reviews, and if it passes, the PR is ready for human approval.
We're also expanding the bug hunter's vocabulary. Right now it catches hardcoded URLs, shell injection patterns, and documentation drift. We're adding: dead code detection, unused import cleanup, test coverage gaps, dependency version staleness, and configuration drift between services.yaml and actual Docker compose files.
The long-term vision is a codebase that maintains itself. Not writes itself --- the creative work, the architecture decisions, the product direction, that's human. But the maintenance grind? The port number that drifted. The URL that should use service discovery. The test that uses a relative path. The shell=True that's one $(curl) away from RCE.
That work should happen automatically, continuously, with quality review at every step.
We're not there yet. But as of Monday morning, we have a system that discovers its own bugs, writes its own fixes, and reviews its own code --- and the reviewer is honest enough to reject the fix when it's wrong.
That's a pretty good start.
The GitHub Actions workflows, agent dispatch system, and Atlas PR Guardian shown in this post are running in production on the AitherOS repository.