Early Access Preview
6-Layer GPU Safety

Deploy any model.
Never crash a GPU.

One-click model deployment on cloud GPUs with context overflow protection, automatic failover, VRAM management, and OpenAI-compatible inference. Your model. Your GPU. No DevOps.

The 3-Line Experience

Same code as OpenAI. Your model, your GPU. Change one URL.

python
from openai import OpenAI

client = OpenAI(base_url="https://gateway.aitherium.com/v1", api_key="aither_sk_...")
response = client.chat.completions.create(model="gemma-4-32b", messages=[...])

Works with any OpenAI-compatible client: LangChain, LlamaIndex, Cline, Continue, Cursor.

The Problem with Self-Hosted LLMs

Context Overflow

Send too many tokens and the GPU runs out of KV cache memory. Process crashes. Machine bricks.

No Recovery

GPU crashes require manual SSH, process kill, and re-provisioning. Nobody is watching at 3 AM.

No Degradation

When things go wrong, the entire system stops. No fallback, no slower-but-working alternative.

Tooling Gaps

IDE agents expect OpenAI endpoints. Raw vLLM speaks a slightly different dialect. Things break.

6 Layers of Defense

Every request passes through six independent safety layers. Any single layer prevents a GPU crash. Together, they make context overflow mathematically impossible.

1

Hard Cap (Context Clamping)

max_tokens is clamped to 60% of the context window on every request. No exceptions, no bypass.

hard_cap = max(256, int(context_window * 0.60))

2

Input Truncation

Input tokens are estimated and compared against the available budget. Oversized system prompts are trimmed, preserving the user's actual message.

input_budget = context_window - max_output_tokens - 512

3

Extractive Compression

Instead of cutting text at a boundary, lines are scored by importance and the highest-scoring content is kept. No LLM call, pure heuristic, runs in microseconds.

Headers +5.0, code sigs +3.0, key-values +2.0, query overlap +2.0/term

4

VRAM Placement Manager

Continuously monitors GPU memory. When VRAM drops below threshold, low-priority models offload to cloud. When VRAM recovers, they come back. 15-second evaluation cycle.

P0: orchestrator (never offload) > P1: vision > P2: coding > P3: reasoning

5

Backend Failover Chain

Every backend has a defined fallback. If vLLM crashes, requests route to cloud. If cloud fails, Ollama on CPU. The chain always terminates.

vllm_swap > vllm_cloud > vllm > ollama (CPU, never OOMs)

6

GPU Cluster Auto-Recovery

Health checks detect GPU failures within 30 seconds. A replacement is provisioned automatically. No SSH, no manual intervention.

Local (prio 1) > Remote mesh (prio 2) > Cloud on-demand (prio 3)

TurboQuant KV Cache: 3.8x More Context

Standard FP16 KV cache uses 512 bytes per token per layer. TurboQuant compresses to ~34 KB per token using 4-bit vector quantization with fused Triton kernels.

32K
Standard context capacity
~122K
With TurboQuant (same VRAM)

Less than 0.3% perplexity increase vs FP16. Undetectable in real conversations. Published as aither-kvcache on PyPI.

Under the Hood

Multi-priority queue with per-backend concurrency limits. User requests preempt agent requests, which preempt background tasks. No thundering herd.

Orchestrator (8B): 24 concurrent (continuous batching)
Reasoning (14B): 8 concurrent
Vision: 4 concurrent
Coding: 6 concurrent
Ollama (CPU): 8 concurrent

The Numbers

6
Independent safety layers
3.8x
Context capacity (TurboQuant)
<30s
Crash detection to recovery

Deploy Any HuggingFace Model

Point at any HuggingFace model ID. We handle VRAM sizing, vLLM configuration, GPU provisioning, health monitoring, and the OpenAI-compatible endpoint.

Gemma 4 32B
128K context
24 GB
Llama 3.3 70B
128K context
48 GB
Qwen 3 32B
128K context
24 GB
DeepSeek R1 70B
64K context
48 GB
Mistral Large 2
128K context
80 GB
Any HF model
Model default context
Auto-sized

How It Works

1

Pick a model

Choose from the catalog or paste any HuggingFace model ID. The system calculates VRAM requirements and finds the cheapest GPU that fits.

2

Deploy

One click provisions the GPU, installs vLLM, loads the model, configures TurboQuant, and registers the OpenAI-compatible endpoint. SSE progress streaming shows every step.

3

Use it

Point your OpenAI client at the gateway URL. All 6 safety layers are active. Monitor usage, costs, and health from the dashboard.

Technical Deep Dive

Read the full engineering breakdown of every safety layer, with code excerpts from the actual production system.

Read the blog post