Early Access Preview
Back to blog
multimodalembeddingsCLIPComfyUIMediaGraphlocal-firstGoogleGemini

Google Charges for Multimodal Embeddings — We Built It Free with a GPU You Already Own

March 11, 20265 min readAitherOS
Share

Google just dropped Gemini Embedding 2 — their new multimodal embedding API that maps text, images, and documents into a unified vector space. It's genuinely impressive technology. It's also a cloud API that costs money per request, sends your data to Google's servers, and stops working when your internet goes down.

We built the same thing. It runs on your local GPU. It costs nothing. And it autonomously indexes every media file in your workspace while you sleep.

What Are Multimodal Embeddings?

Traditional search is modal — text searches find text, image searches find images. Multimodal embeddings break that wall. They map everything — images, documents, audio, video frames — into a single mathematical space where similarity is measured by distance.

Ask "sunset over mountains" and get back photographs, paintings, video frames, and documents that match — regardless of what format they're in. The embedding doesn't care if it's a JPEG or a paragraph. It understands meaning across modalities.

Google's Gemini Embedding 2 does this with a 3072-dimensional vector space. Ours does it in 768 dimensions using CLIP ViT-L/14. Both work. Ours is free.

The CLIP Trick: Your Image Generator Already Loaded It

Here's the insight that makes this zero-cost: if you're running ComfyUI, Stable Diffusion, or any modern image generation pipeline, CLIP is already loaded in your VRAM.

CLIP (Contrastive Language-Image Pretraining) is the model that understands your text prompts during image generation. It's sitting there in GPU memory, doing nothing between generations. We just... ask it to also produce embeddings.

AitherOS routes embedding requests through our Canvas service — the same service that manages ComfyUI. When you ask for a multimodal embedding:

  1. Canvas checks if CLIP is loaded (it almost always is)
  2. Feeds the image or text through CLIP's encoder
  3. Returns a 768-dimensional embedding vector
  4. Total additional VRAM cost: zero

No new model download. No new GPU allocation. No API key. No billing dashboard.

The Four-Backend Fallback Chain

We don't just rely on ComfyUI being available. The system has a four-backend priority chain:

1. ComfyUI-hosted CLIP (default) — Routes through the Canvas service. Shares VRAM with your image generation pipeline. Zero marginal cost.

2. Standalone CLIP — If ComfyUI isn't running, loads CLIP ViT-L/14 via sentence-transformers directly. Still local, still free, just uses its own VRAM allocation.

3. Text Fallback — For audio or edge cases where CLIP can't help, captions the media via a vision model, then embeds the caption as text. Lossy but functional.

4. Gemini Embedding 2 (opt-in) — Google's cloud API is available as an optional backend. We truncate its 3072-dim output to 768 via Matryoshka Representation Learning. But it's off by default — you have to explicitly enable it.

The chain fails over silently. If ComfyUI is down, you get standalone CLIP. If your GPU is busy, you get text fallback. Your search never breaks.

MediaGraph: Autonomous Indexing You Don't Configure

Here's where it gets interesting. Google gives you an API endpoint. You call it, you get vectors back, you figure out storage yourself. We built a system that does the work for you.

MediaGraph is a faculty graph — the same architecture we use for CodeGraph (which autonomously indexes your codebase for semantic code search). It runs as a background task in the agent kernel's 30-second tick cycle:

  • First boot: Scans your entire workspace for media files (images, audio, video, documents — 13 supported types)
  • Every hour: Incremental refresh — checks file modification times, only re-embeds what changed
  • Always: Maintains a searchable graph of every media file with its embedding, modality, hash, and metadata

You don't configure it. You don't trigger it. You don't even know it's running. Your media files just become searchable.

The graph persists to disk (pickle serialization) and syncs across the system via GraphSyncBus — the same infrastructure that keeps code graphs, knowledge graphs, and memory graphs consistent.

Five Tools Your Agents Already Have

The moment MediaGraph is active, every agent in the system gains five new capabilities via MCP tools:

ToolWhat It Does
multimodal_search"Find images similar to this description" — cross-modal text-to-media search
multimodal_embedEmbed a specific file on demand
multimodal_indexTrigger indexing of a directory
multimodal_similar"Find files that look like this one" — media-to-media similarity
multimodal_statsIndex health, backend status, file counts by modality

An agent reviewing a pull request can now search for "does this codebase have any images of database schemas?" and get actual results. An agent writing documentation can find screenshots that match the feature being described. An agent doing security review can search for "images containing credentials or API keys."

Cross-modal search means the query modality doesn't have to match the result modality. Text finds images. Images find similar images. Everything lives in the same 768-dimensional space.

The Numbers

Google Gemini Embedding 2AitherOS MediaGraph
CostPer-request API pricingFree (local GPU)
Data locationGoogle CloudYour machine
Internet requiredYesNo
Dimensions3072768 (CLIP-native)
Autonomous indexingNo (API only)Yes (background, hourly refresh)
Agent integrationBuild it yourself5 MCP tools, pre-wired
VRAM overheadN/A (cloud)Zero (reuses ComfyUI's CLIP)
Offline capableNoYes
Fallback chainNone4 backends, silent failover

The Philosophy

Google's approach: "Send us your data, we'll embed it, pay per request."

Our approach: "Your GPU already has the model loaded. Use it."

This isn't about Google's technology being bad — Gemini Embedding 2 is excellent. It's about the principle that if you own the hardware, you shouldn't rent the capability. CLIP ViT-L/14 produces perfectly good multimodal embeddings. Your ComfyUI instance already loaded it. The only missing piece was the plumbing to expose those embeddings as a searchable index.

We built the plumbing. MediaGraph handles the indexing. Five MCP tools handle the agent integration. The fallback chain handles reliability. And your data never leaves your machine.

The future of AI isn't paying per embedding. It's owning the stack.

Enjoyed this post?
Share