Early Access Preview
Back to blog
perceptionomniparserscreen-awarenessgui-agentsvramautomation

OmniParser Integration: Teaching AitherOS to See and Understand Every Screen

March 8, 20269 min readterra
Share

OmniParser Integration: Teaching AitherOS to See and Understand Every Screen

AitherOS agents have had vision capabilities for a while — our screen awareness loop captures what's on screen and sends it through a vision model that returns natural language descriptions. "The user has VS Code open with a Python file" is useful context, but it's not enough for an agent that needs to act on the screen.

What if an agent could see every button, every text field, every clickable element — with pixel coordinates?

That's what OmniParser gives us.

The Problem: Vision vs. Structure

Our existing screen awareness pipeline in AitherDesktop works well for understanding what the user is doing. The Vision model (via the LLM scheduling system) analyzes screenshots and produces descriptions like:

"The user is editing a FastAPI router file in VS Code. There's a terminal panel at the bottom showing pytest output with 3 failures."

Great for context. Terrible for action. If an agent needs to click the "Run" button or type into a search box, it needs coordinates — not prose.

This is the fundamental gap between perception (what's on screen) and interaction (what can I click). OmniParser bridges that gap.

What is OmniParser?

OmniParser V2 is Microsoft's open-source screen parsing toolkit. It combines two models:

  1. YOLO-based icon detector — finds UI elements (buttons, inputs, checkboxes, icons) with bounding boxes
  2. Florence-2 captioner — generates natural language descriptions for each detected element

The output is a structured list of UI elements with labels, types, bounding boxes, confidence scores, and captions. Exactly what an agent needs to navigate a GUI.

We forked it to Aitherium/OmniParser to maintain compatibility with our model paths and add a few AitherOS-specific patches.

Architecture: Service + Client + Desktop Integration

We didn't just bolt OmniParser onto a script. We built it as a first-class AitherOS microservice following the same patterns as our other 203 services.

The Service: AitherOmniParser

The service wraps OmniParser's YOLO + Florence pipeline in a FastAPI application using the standard AitherOS service bootstrap pattern. The port is resolved dynamically from the service registry — no hardcoded values.

Key design decisions:

  • Lazy model loading — models aren't loaded at boot. They load on first request or via an explicit /load_models call. This matters because OmniParser's YOLO + Florence models consume ~2-4GB of VRAM.
  • VRAM coordination — before loading models, the service acquires a VRAM slot from the LLM scheduler. This prevents OmniParser from competing with active LLM inference or ComfyUI generations.
  • Element classification — raw YOLO detections get classified into semantic types (button, input, checkbox, icon, text, image, link, menu) based on caption content and aspect ratios.

The endpoints:

EndpointPurpose
POST /parseParse a screenshot from a file path
POST /parse/uploadParse an uploaded image
POST /detectDetection only (no captioning)
POST /find_elementFind a specific element by description
POST /load_modelsPre-load models into VRAM
POST /unload_modelsRelease VRAM

The Client Library

A dedicated client with circuit breaker protection, automatic retries, and typed response objects. Agents interact with it through a simple API:

client = get_omniparser_client()
result = await client.parse_screenshot("/path/to/screenshot.png")

# Structured access
for elem in result.interactable:
    print(f"{elem.label} at ({elem.center_x}, {elem.center_y})")

# Convenience filters
buttons = result.buttons       # Just buttons
inputs = result.inputs         # Just text fields
text = result.text_elements    # Just text labels

The typed response objects give agents clean access to UI elements without parsing raw JSON.

Service Registration

OmniParser is registered in the central service registry like every other AitherOS service — specifying its group (perception), GPU requirement, and description. Boot order, port resolution, health checks, and Docker networking all derive from this single registration.

The Integration Point: AitherDesktop Screen Awareness

The real power isn't the service in isolation — it's the integration with AitherDesktop's screen awareness loop.

Before: Vision Only

Screen Capture → Diff Check → Vision Model → NL Description → Agent Context

The existing ScreenAwarenessLoop captures screenshots at configurable intervals (default 10s), checks if the screen has changed enough (30% pixel diff threshold), and sends changed frames to a vision model for analysis. The result is a ScreenContext object with a natural language vision_summary.

After: Vision + OmniParser

Screen Capture → Diff Check → ┬→ Vision Model    → NL Description  ─┐
                               └→ OmniParser      → UI Elements     ─┤
                                                                      └→ Merged ScreenContext

Now both run in parallel during the slow-path analysis. Vision gives the agent understanding. OmniParser gives the agent targets.

The UIElement Dataclass

@dataclass(frozen=True)
class UIElement:
    id: str              # Unique identifier
    label: str           # Human-readable label
    element_type: str    # button, input, checkbox, icon, text, etc.
    bbox: Tuple[float, float, float, float]  # (x1, y1, x2, y2) normalized
    center_px: Tuple[int, int]               # Pixel coordinates for clicking
    confidence: float    # Detection confidence
    caption: str         # Florence caption
    ocr_text: str        # Extracted text
    interactable: bool   # Can this element be clicked/typed into?

The center_px field is the critical one for GUI agents — it's the exact pixel coordinate where a click would hit the element. No more coordinate math from bounding boxes.

Rate Limiting and Circuit Breaking

OmniParser is heavier than a simple screenshot diff. We apply several guards:

  • Minimum interval — configurable via coworker.omniparser.min_interval (default 30s). OmniParser won't run more often than this even if the screen changes rapidly.
  • Circuit breaker — if OmniParser fails 3 times consecutively (service down, VRAM unavailable, timeout), it backs off automatically and stops calling until the service recovers.
  • Enrichment mode — by default, OmniParser runs alongside Vision (enrich_vision: true), not instead of it. If OmniParser fails, agents still get vision context.
  • Element capmax_elements: 50 prevents flooding agent context with hundreds of tiny UI elements from complex screens.

Configuration

All OmniParser settings live in AitherDesktop's config under coworker.omniparser:

coworker:
  omniparser:
    enabled: true
    min_interval: 30.0        # Seconds between OmniParser runs
    box_threshold: 0.05       # YOLO confidence threshold
    include_captions: true    # Florence icon captioning
    include_ocr: true         # OCR text extraction
    enrich_vision: true       # Run alongside Vision, not instead of
    max_elements: 50          # Cap elements per parse

Automated Setup: AitherZero Script

Setting up OmniParser involves Python dependencies (torch with CUDA, ultralytics, transformers, easyocr), model downloads (~2GB from HuggingFace), and environment configuration. We automated the entire pipeline in a single AitherZero script:

AitherZero/library/automation-scripts/50-ai-setup/5004_Setup-OmniParser.ps1

Run it:

pwsh -File ./AitherZero/library/automation-scripts/50-ai-setup/5004_Setup-OmniParser.ps1

The script handles 8 steps:

  1. Python check — verifies Python 3.10+ is available
  2. Conda environment (optional) — creates or activates omniparser conda env with -UseConda
  3. Dependencies — installs torch+CUDA, ultralytics, transformers, easyocr, Pillow, huggingface-cli
  4. GPU/CUDA check — detects GPU availability and CUDA version
  5. Model download — 3-tier fallback: HuggingFace CLI → python -m huggingface_hub → git clone
  6. Verification — confirms all model files are present and correct
  7. Environment config — sets OMNIPARSER_WEIGHTS_DIR and writes .env file
  8. Health check — starts the service and verifies /health responds

Parameters for flexibility:

# Skip model download (already have them)
./5004_Setup-OmniParser.ps1 -SkipModels

# Custom weights directory
./5004_Setup-OmniParser.ps1 -WeightsDir "D:\Models\OmniParser"

# Force reinstall everything
./5004_Setup-OmniParser.ps1 -Force

# Clone our fork for development
./5004_Setup-OmniParser.ps1 -CloneFork

VRAM Coordination: Playing Nice with LLMs

The trickiest part of this integration was VRAM management. On a consumer GPU (8-24GB), OmniParser's models compete with:

  • Active LLM inference (orchestrator, reasoning, coding models)
  • ComfyUI image generation
  • Embedding model for vector search
  • Voice models (whisper, piper)

We solved this the same way we solve it for everything else — VRAM slots managed by the LLM scheduling system. Before loading models, the service requests a VRAM allocation (approximately 3GB). If granted, it proceeds. If the scheduler is unreachable, it loads anyway (fail-open design).

The fail-open design is intentional. If the scheduler is down, we'd rather have OmniParser work (and risk VRAM pressure) than silently disable screen parsing. The worst case is an OOM that the OS handles; the common case is that it works fine because the user isn't running everything simultaneously.

What Agents Can Do Now

With structured UI elements in their context, agents can:

  • Navigate GUIs — "Click the Save button" maps to a specific UIElement with center_px coordinates
  • Fill forms — identify input fields by label and type into them
  • Read screen state — know exactly what checkboxes are checked, what menu items are available
  • Provide better assistance — instead of "I see VS Code", an agent knows there are 3 open tabs, a search box with text "TODO", and a Run button in the toolbar

The ScreenContext now carries both:

  • vision_summary: "The user is debugging a test failure in pytest"
  • ui_elements: 23 structured elements with coordinates, types, and labels

Agents get the why from Vision and the what/where from OmniParser.

Performance

On a RTX 3080 (10GB VRAM):

OperationTime
YOLO detection~50-100ms
Florence captioning (20 elements)~200-400ms
Full parse with OCR~500-800ms
Total with VRAM acquire~600-900ms

Sub-second parsing means OmniParser can run on every significant screen change without feeling sluggish. Combined with the 30-second minimum interval, VRAM impact is minimal.

What's Next

This integration lays the groundwork for full GUI automation agents. The immediate next steps:

  • Action execution — wiring UIElement coordinates to pyautogui/Win32 input injection
  • Element tracking — correlating elements across frames to detect state changes (button enabled → disabled)
  • Task recording — watching a user perform a task and learning the sequence of UI interactions
  • Cross-platform — extending OmniParser support to Linux (X11/Wayland) and macOS screen capture

The gap between "AI that watches your screen" and "AI that uses your screen" is shrinking. OmniParser gives AitherOS agents the structured understanding they need to cross it.


The OmniParser service is available in AitherOS v0.2.1+. Run 5004_Setup-OmniParser.ps1 to set it up, or enable it in AitherDesktop settings under Coworker > OmniParser.

Enjoyed this post?
Share