Your Image Generator Already Loaded CLIP — Why Not Use It for Search?
Google dropped Gemini Embedding 2 last week — a natively multimodal embedding model that handles text, images, video, audio, and documents in a single embedding space. The demos were impressive. Type "sunset over ocean" and find matching photos without any text metadata. Embed a product image and find similar items. Cross-modal search that actually works.
We wanted that capability in AitherOS. But we didn't want to send every image we process to Google's API to get it.
The Observation That Changed the Architecture
AitherOS already runs ComfyUI for image generation. Every time we generate an image — txt2img, img2img, style transfer — ComfyUI loads a CLIP model. CLIP (Contrastive Language-Image Pretraining) is the component that understands what text means visually. It's why you can type "cyberpunk city at night" and get a matching image.
CLIP doesn't just encode text for diffusion. It maps text and images into the same 768-dimensional vector space. Two things that are semantically similar — say, the text "a golden retriever" and a photo of one — land near each other in that space.
That's exactly what a multimodal embedding engine needs.
So the question became: ComfyUI already has a CLIP model loaded on our GPU. Why load a second copy for embeddings?
What We Built
A unified multimodal embedding engine that can embed images, text, video frames, audio, and documents into a shared 768-dimensional space — with ComfyUI's existing CLIP infrastructure as the primary backend.
Raw media (bytes/path)
→ MultimodalEmbeddingEngine.embed_media()
→ Backend chain: ComfyUI CLIP → standalone CLIP → text fallback
→ Returns: 768-dim normalized vector + modality + backend + media hash
→ Store in Nexus vector DB → cross-modal search
The backend chain is:
-
ComfyUI-hosted CLIP — Routes through AitherCanvas (our ComfyUI wrapper service at port 8108). Canvas manages the CLIP model alongside ComfyUI's VRAM budget. Same GPU, coordinated loading.
-
Standalone CLIP ViT-L/14 — If Canvas isn't running, the engine loads CLIP directly via sentence-transformers. Same model, same embedding space, just loaded separately.
-
Text fallback — If neither CLIP backend is available (no GPU, no VRAM), we caption the media through our Vision service, then embed the caption through nomic-embed-text. Lossy but functional.
Gemini Embedding 2 is available as an opt-in fourth backend. Add gemini to the priority list in embeddings.yaml if you want cloud multimodal. We just don't make it the default.
The Canvas Integration
AitherCanvas already wraps ComfyUI — starting/stopping the subprocess, health monitoring, VRAM slot gating through MicroScheduler, safety filtering, model management. Adding a CLIP embedding endpoint was surgical:
@app.post("/embed/clip")
async def embed_clip(request: Dict[str, Any]):
model = await _get_clip_embed_model() # Lazy singleton, VRAM-checked
if text:
vec = model.encode(text, normalize_embeddings=True)
else:
img = Image.open(io.BytesIO(b64decode(image_base64))).convert("RGB")
vec = model.encode(img, normalize_embeddings=True)
# MRL truncation + L2 normalize to 768-dim
return {"embedding": vec, "model": "clip-ViT-L-14", "dimensions": 768}
The CLIP model is loaded once (lazy, asyncio-lock protected) and stays resident. When ComfyUI needs VRAM for a generation job, MicroScheduler coordinates — same as it does for vLLM and Vision models.
On the engine side, the ComfyUI backend is a simple HTTP call:
async def _embed_comfyui(self, data, modality, mime_type, media_hash):
canvas_url = get_canvas_client()._get_canvas_service_url()
payload = {"image_base64": b64encode(data), "dimensions": 768}
resp = await httpx.AsyncClient().post(f"{canvas_url}/embed/clip", json=payload)
vec = self._normalize_dimensions(resp.json()["embedding"])
return MultimodalEmbedding(vector=vec, backend="comfyui", ...)
If Canvas is down, the engine marks comfyui as unavailable and falls through to standalone CLIP. No errors, no retries, just the next backend. The user never knows which backend served the embedding — they all produce the same 768-dim normalized vector.
Dimension Normalization
Not all CLIP variants output 768 dimensions. Gemini Embedding 2 outputs 3072. Future SigLIP models may output different sizes. We standardize everything via Matryoshka Representation Learning (MRL) truncation:
- Take the first 768 dimensions (MRL: early dimensions encode the most information)
- L2-normalize the truncated vector
This means you can switch backends without re-embedding your entire collection. A vector from Gemini and a vector from CLIP are both valid 768-dim points in comparable (though not identical) spaces.
Cross-Modal Search
The whole point: you embed media through one path and query through another.
engine = get_multimodal_engine()
# Store: embed an image
img_emb = await engine.embed_image("photo_of_sunset.jpg")
await nexus.store_multimodal(embedding=img_emb.vector, modality="image", ...)
# Query: search with text
text_emb = await engine.embed_text_aligned("warm sunset over the ocean")
results = await nexus.search_multimodal(query_embedding=text_emb.vector)
embed_text_aligned() is the key method. It embeds text through the same CLIP model as images, not through nomic-embed-text (which lives in a completely different embedding space). This ensures the query vector and the stored vectors are in the same space.
Regular text embeddings via EmbeddingEngine.embed() still use nomic-embed-text for standard RAG. The two systems are separate by design — you don't mix CLIP and nomic vectors in the same collection.
Prism Integration
Our training data service (AitherPrism) already processes video frames for classification. We added a single parameter:
result = await prism.classify_frame(frame, embed_multimodal=True)
When enabled, Prism auto-embeds the raw frame image after classification. The multimodal hash and backend get stored in the frame's metadata. Every frame we classify becomes searchable by visual content — not just by the text labels we assigned.
Three new Prism endpoints expose the system:
POST /embed/multimodal— Upload a file, get an embeddingPOST /search/multimodal— Cross-modal vector searchGET /multimodal/status— Backend health, cache stats, call counts
The Cache Layer
Embedding the same image twice is wasteful. The engine maintains an LRU cache keyed by SHA256 hash of the media bytes:
- 2,000 entries, 2-hour TTL
- If the same image comes through again, return the cached vector instantly
- Hash is truncated to 32 chars for compactness
- Shared across all calls (the engine is a singleton)
Why Local-First Matters
We could have just used Gemini Embedding 2. It's genuinely impressive — natively multimodal, handles audio and video natively (not just images), 3072 dimensions.
But:
- Every image goes to Google. Creative assets, proprietary designs, client photos — all leaving your machine.
- Latency. A local CLIP embed takes ~15ms. A round-trip to Gemini takes 200-800ms.
- Cost. Embedding at scale adds up fast. Local inference on hardware you already own is free.
- Availability. API keys expire. Services go down. Rate limits hit. Local models just work.
ComfyUI was already running. CLIP was already loaded. The GPU was already paid for. We just asked it to do one more thing.
Numbers
- 82 tests covering all 4 backends, fallback chain, caching, cross-modal alignment, VRAM safety, Canvas integration, MindClient, NexusClient, Prism
- 4 backends in the engine (comfyui, clip, gemini, text_fallback)
- 768 dimensions — consistent across every backend
- ~15ms local CLIP embedding latency (GPU)
- 0 new GPU models loaded when ComfyUI is already running
Files
| File | What it does |
|---|---|
lib/core/MultimodalEmbeddingEngine.py | Core engine — 4 backends, cache, normalization |
config/embeddings.yaml | Backend priority, model config, VRAM thresholds |
services/perception/AitherCanvas.py | /embed/clip endpoint on Canvas service |
lib/clients/canvas.py | embed_clip() client method |
lib/agents/AitherMindClient.py | embed_media(), search_multimodal() |
lib/clients/nexus.py | store_multimodal(), search_multimodal() |
services/training/AitherPrism.py | Frame embedding, /embed/multimodal endpoints |
The embedding engine is a singleton. The Canvas CLIP model is a lazy-loaded singleton. The cache is shared. No redundant allocations, no VRAM waste.
If your system already has a GPU running image generation, you're sitting on a multimodal embedding engine. You just haven't turned it on yet.