Early Access Preview—AitherOS is in active development. Features may change, break, or disappear.

LLM

0/24

GPU0/0GB

IDLEFREE

Connecting to services…

•

Live Demo

Invite Only

Theme

GitHub

Live Demo

Invite Only

Theme

GitHub

Back to blog

engineeringdesignagentsvisioninfrastructure

From Screenshot to Production React in Four Steps: The Iris Design-to-Code Pipeline

May 5, 202612 min readDavid

Every frontend team has the same bottleneck. A designer produces a mockup. A developer stares at it, mentally decomposes it into components, estimates spacing by eyeballing pixel distances, picks colors by zooming in and using an eyedropper, and then writes the code from scratch. The mockup is a picture. The code is a structure. The translation between them lives entirely in a human's head.

We automated that translation. Not with a prompt that says "build me a landing page" -- that produces generic slop. We automated it with computer vision, structured extraction, and agent-driven code generation. The result is a pipeline where you hand Iris an image of a UI and get back production React components that match it.

This post documents how the pipeline works, why each stage exists, and what it means for how we build interfaces.

The Pipeline

Four stages. Each one transforms the input into a progressively more structured representation until it becomes code.

Mockup Image
    |
    v
[Stage 1: OmniParser] --- YOLO detection + Florence captioning + OCR
    |
    v
[Stage 2: Vision Analysis] --- Layout, colors, typography, spacing
    |
    v
[Stage 3: LLM Synthesis] --- Structured component specification
    |
    v
[Stage 4: Demiurge Agent] --- Production React + Tailwind code
    |
    v
Production Components

One API call triggers the entire chain: POST /design-to-code on Iris (port 8786). Or via MCP: design_to_code(image_path, brief, framework, style_system). The caller gets back working code.

Stage 1: OmniParser -- Seeing What's There

The first problem is deceptively hard: given a screenshot, identify every interactive and visual element in it. Not "there's a button somewhere" but "there are 5 buttons, 3 text inputs, 12 text blocks, and 8 icons, and here are their bounding boxes, labels, and visual descriptions."

OmniParser combines three models to solve this:

YOLO object detection scans the image and draws bounding boxes around every UI element it recognizes. Buttons, inputs, cards, navigation bars, modals, dropdowns, icons. Each detection comes with a confidence score and pixel-precise coordinates. A typical landing page produces 30-60 detections.

Florence-2 captioning takes each detected region and generates a natural language description. A green button with a right arrow becomes "Submit button with arrow icon." A text field with a placeholder becomes "Email input field with placeholder text." These captions bridge the gap between pixel regions and semantic meaning.

OCR extraction pulls every visible text string from the image. Headlines, body copy, button labels, placeholder text, navigation items. This is not optional -- you cannot generate accurate code without knowing what the text actually says. A pricing card that says "$29/mo" needs that exact string in the output, not a placeholder.

The combined output from OmniParser is a structured inventory of everything visible in the mockup:

{
  "elements": [
    {"type": "button", "bbox": [420, 680, 580, 720], "caption": "Get Started button with gradient background", "text": "Get Started"},
    {"type": "input", "bbox": [300, 620, 640, 660], "caption": "Email input field with placeholder", "text": "Enter your email"},
    {"type": "text", "bbox": [200, 100, 740, 160], "caption": "Large heading text, white on dark background", "text": "Build Faster with AI"}
  ],
  "total_detections": 47
}

This is the raw material. It tells us what exists. It does not yet tell us how it's organized.

Stage 2: Vision Analysis -- Understanding the Design

Stage 2 takes the element inventory from OmniParser and combines it with direct vision model analysis of the full image to extract four design dimensions:

Layout structure. How are the elements spatially organized? Is this a single-column layout or a grid? Where are the section breaks? What is the visual hierarchy? The vision model identifies patterns like "hero section spanning full width, followed by a three-column feature grid, followed by a pricing section with four cards, followed by a footer." This is not guesswork -- it is spatial reasoning over the bounding box coordinates combined with visual pattern recognition.

Color scheme. Every distinct color in the image is extracted with its hex value and semantic role. Background colors, text colors, accent colors, border colors, gradient stops. The vision model distinguishes between "this blue is used for primary actions" and "this blue is used for informational text" based on context. A typical extraction produces 8-15 distinct colors with role assignments:

{
  "background": "#0f0f23",
  "surface": "#1a1a2e",
  "primary_accent": "#14b8a6",
  "text_primary": "#f8fafc",
  "text_secondary": "#94a3b8",
  "border": "#334155"
}

Typography. Font families (detected from visual characteristics when not explicitly available), size hierarchy (heading sizes relative to body text), weight variations, line heights, letter spacing. The model extracts the typographic scale: if the heading is roughly 3x the body text size and the body text appears to be 16px, the heading is 48px and subheadings are likely 24px or 32px.

Spacing. Section padding, card internal padding, gap between grid items, margin patterns. These are derived from the bounding box positions of detected elements. If three cards are evenly spaced with 24px gaps between them and 32px internal padding, that pattern is captured.

The output of Stage 2 is a design analysis document -- not yet code, not yet a component tree, but a complete description of the visual design language used in the mockup.

Stage 3: LLM Synthesis -- The Component Specification

This is where the intelligence lives. Stages 1 and 2 produced raw observations -- elements, colors, spacing. Stage 3 synthesizes those observations into a structured component specification that a code generator can implement.

The LLM receives the OmniParser element inventory and the vision analysis, and produces a JSON specification:

{
  "page_layout": "flex-col",
  "color_scheme": {
    "bg": "#0f0f23",
    "surface": "#1a1a2e",
    "accent": "#14b8a6",
    "text": "#f8fafc",
    "muted": "#94a3b8"
  },
  "typography": {
    "font_family": "Inter",
    "base_size": "16px",
    "scale": [48, 32, 24, 20, 16, 14]
  },
  "components": [
    {
      "name": "HeroSection",
      "type": "hero",
      "props": {
        "headline": "Build Faster with AI",
        "subheadline": "The operating system for intelligent agents",
        "cta_text": "Get Started",
        "cta_style": "gradient"
      },
      "layout": "centered-stack",
      "spacing": { "py": "6rem", "gap": "1.5rem" }
    },
    {
      "name": "FeatureGrid",
      "type": "section",
      "children_count": 3,
      "grid": "3-col",
      "gap": "2rem"
    },
    {
      "name": "PricingCard",
      "type": "card",
      "props": {
        "title": "Pro",
        "price": "$29/mo",
        "features": ["Feature 1", "Feature 2", "Feature 3"],
        "cta_text": "Start Free Trial"
      },
      "variants": ["default", "highlighted"],
      "spacing": { "p": "2rem" }
    }
  ],
  "accessibility_requirements": [
    "Color contrast ratio >= 4.5:1 for all text",
    "All interactive elements must be keyboard accessible",
    "Images must have alt text",
    "Focus indicators must be visible"
  ]
}

The critical property of this specification is that it is implementation-agnostic but implementation-complete. It does not contain React code or Tailwind classes. It contains everything needed to write them. The component names, the prop interfaces, the layout relationships, the exact text content, the color values, the spacing values, the accessibility requirements.

This separation matters. The same specification can produce React + Tailwind, Vue + UnoCSS, Svelte + vanilla CSS, or any other framework combination. The design decisions are captured in the spec. The implementation decisions are deferred to Stage 4.

The LLM also makes architectural decisions that a pure vision system cannot: which elements should be separate components versus inline, which props should be configurable, whether a group of similar elements warrants a reusable component with variants. A row of four pricing cards with identical structure but different content should be one PricingCard component rendered four times, not four separate components. That decision requires understanding component design patterns, not just visual analysis.

Stage 4: Demiurge -- Writing the Code

The component specification goes to Demiurge, AitherOS's code generation agent. This is not a template fill. Demiurge is a full agent with a ReAct loop -- it reasons about the specification, writes code, evaluates it, and iterates.

The dispatch is explicit:

ForgeSpec(
    agent="demiurge",
    effort_level=7,
    intent_type="coding",
    task="Implement this component spec as React + Tailwind",
    context=component_spec_json
)

Effort level 7 puts this in the reasoning model tier. This is deliberate -- component generation from a spec is not a trivial task. The agent needs to make judgment calls about responsive breakpoints, animation choices, hover states, and edge cases that the spec does not explicitly cover.

Demiurge produces production-ready code. Not a prototype. Not a wireframe with TODO comments. Components with proper TypeScript types, responsive layouts, hover/focus states, dark mode support (when the design indicates it), semantic HTML, and ARIA attributes.

The output is structured: each component in its own file, a page component that composes them, and any shared utilities (like a color theme object) extracted into a constants file.

Why Four Stages Instead of One

The obvious question: why not just give a vision model the screenshot and say "write React code for this"?

Because that approach fails in three specific ways.

Precision. A single-shot vision-to-code model approximates colors, guesses at spacing, and invents text content. It sees "roughly teal" and picks #2dd4bf when the actual color is #14b8a6. It estimates "about 24px gap" when the gap is 32px. These small errors compound across a full page. OmniParser's OCR extracts exact text. The vision analysis extracts exact colors. The component spec preserves exact values.

Structure. A single-shot model produces a monolithic blob of JSX. No component decomposition, no prop interfaces, no reusability. The three-stage extraction pipeline (elements, then design language, then component architecture) forces structural decisions to be made explicitly. The output is a component tree, not a page dump.

Controllability. With a single-shot approach, you get what you get. With the pipeline, you can intervene at any stage. Don't like the component decomposition? Edit the spec and re-run Stage 4. Want to change the color scheme? Modify the color values in the spec. Want Vue instead of React? Change the framework parameter. The intermediate representation (the component spec) is human-readable and human-editable.

What OmniParser Actually Brings

OmniParser is the foundation that makes the rest of the pipeline reliable. Without it, the vision model is doing everything: detecting elements, reading text, analyzing layout, extracting colors, and making component decisions -- all in a single forward pass. That is too many objectives for one model to optimize simultaneously.

With OmniParser, the vision model receives a pre-annotated image. It knows where every button is, what every text block says, and what every icon looks like. It can focus entirely on the higher-order analysis: layout patterns, design language, visual hierarchy. The hard perceptual work is already done.

This is the same principle behind the cognitive pipeline in AitherOS's chat system. The 12-stage context pipeline does not ask one model to do everything. Each stage contributes a specific type of information -- identity, memory, knowledge, affect, web search -- and the final model receives all of it as assembled context. Separation of concerns is not just a code architecture principle. It is a model architecture principle.

The MCP Integration

The pipeline is exposed as an MCP tool, which means any AI assistant connected to AitherOS can invoke it:

design_to_code(
    image_path="/path/to/mockup.png",
    brief="Dark AI landing page with pricing cards and feature grid",
    framework="react",
    style_system="tailwind"
)

The brief parameter is optional but significantly improves results. It gives the LLM synthesis stage additional context about intent. "Dark AI landing page" tells the model that the dark color scheme is intentional (not an artifact of the screenshot), that the audience is technical, and that the tone should be professional-futuristic rather than corporate-conservative. These are design decisions that cannot be extracted from pixels alone.

The framework and style system parameters control Stage 4 output. Currently supported: React + Tailwind (the default and best-tested path), with Vue and Svelte as secondary targets.

The Endpoint

curl -X POST localhost:8786/design-to-code \
  -F "image=@mockup.png" \
  -F "brief=Dark AI landing page with pricing cards" \
  -F "framework=react" \
  -F "style_system=tailwind"

Iris (port 8786) handles the orchestration. It calls OmniParser for detection, runs the vision analysis, synthesizes the component spec, dispatches to Demiurge via AgentForge, and returns the generated code. The caller does not need to know about the internal stages.

Response time depends on mockup complexity. A simple landing page with 30-40 detected elements takes roughly 45-90 seconds end-to-end. A complex dashboard with 100+ elements takes 2-3 minutes. The bottleneck is Stage 4 (agent code generation), not the vision stages.

Where This Fits

This is not a replacement for design systems. If you have a mature component library with established patterns, you do not need a vision model to decompose your mockups -- your designers are already working within the component vocabulary.

Where this pipeline shines is the cold start. A new project. A new client. A competitor's landing page you want to study. A Dribbble concept you want to prototype. A hand-drawn whiteboard sketch photographed with a phone. Any situation where you have a visual reference and need working code, but do not have a design system to draw from.

It also shines for iteration speed. Take a screenshot of your current UI, hand it to the pipeline with a modified brief ("same layout but with a warmer color palette and rounded corners"), and get a variant in 90 seconds. Use it as a starting point, not a final product. The 80% that the pipeline gets right saves hours. The 20% you adjust by hand takes minutes.

The Flywheel Connection

Every design-to-code invocation generates training data. The input mockup, the OmniParser detections, the vision analysis, the component spec, and the final code -- all logged to Strata. Over time, this builds a dataset of (mockup, component_spec, code) triples that can be used to fine-tune the vision and synthesis stages.

The same flywheel that improves our orchestrator model applies here. Real usage generates real training data. The pipeline gets better at detecting UI patterns, more accurate at color extraction, more sophisticated in component decomposition. Each invocation is both a product and a training example.

Technical Details

OmniParser: YOLO-based detection model + Florence-2 captioning + Tesseract/EasyOCR. Runs on GPU. Detection takes 2-5 seconds depending on image resolution.

Vision Analysis: Routed through MicroScheduler to the active vision-capable model. Receives the original image plus OmniParser annotations. Produces structured JSON.

LLM Synthesis: Effort level 6 (orchestrator tier). Receives OmniParser output + vision analysis. Produces the component specification JSON. Typical generation: 800-2000 tokens.

Demiurge Dispatch: AgentForge with effort level 7 (reasoning tier). Full ReAct loop with tool access. Produces TypeScript/React files. Typical generation: 2000-8000 tokens depending on component count.

Iris Orchestration: FastAPI service at port 8786. Coordinates all four stages. Handles image upload, temporary storage, stage sequencing, and response assembly.

What This Is Not

This is not "AI replacing designers." The pipeline cannot make design decisions. It cannot decide whether a landing page should have a hero section or lead with social proof. It cannot choose a color palette that communicates trust versus excitement. It cannot determine information hierarchy for a new product.

What it can do is eliminate the mechanical translation step. Once a design decision is made and expressed as a visual -- whether a polished Figma mockup or a rough sketch -- the pipeline handles the conversion to code. The creative work stays human. The mechanical work gets automated.

That is the correct division of labor.

Enjoyed this post?

All posts Try AitherOS