Early Access Preview—AitherOS is in active development. Features may change, break, or disappear.

LLM

0/24

GPU0/0GB

IDLEFREE

Monitoring services…

•Connecting to services…

Live Demo

Invite Only

Theme

GitHub

Live Demo

Invite Only

Theme

GitHub

managed gpu inference

Deploy any model.
Never crash a GPU.

Point a HuggingFace model at a cloud GPU and get an OpenAI-compatible endpoint that physically cannot be overflowed. Six independent clamps sit between every request and the KV cache. No DevOps, no 3am pager.

same code as OpenAI — one URL

python

from openai import OpenAI

client = OpenAI(
    base_url="https://gateway.aitherium.com/v1",
    api_key="aither_sk_...",
)
r = client.chat.completions.create(
    model="gemma-4-32b",
    messages=[...],
)

drop-in for LangChain · LlamaIndex · Cline · Continue · Cursor

Safety clamps

3.8×

Context / VRAM

<30s

Crash → recovery

Overflow crashes

01Why self-hosted LLMs die

Context overflow

One oversized request exhausts the KV cache. The process dies mid-generation and takes the GPU with it.

Nobody is watching

A dead worker needs SSH, a process kill, and re-provisioning — at whatever hour it happened to fall over.

No graceful degrade

When one thing breaks, everything stops. No fallback, no slower-but-alive path to keep serving.

Dialect drift

IDE agents speak OpenAI. Raw vLLM speaks almost-OpenAI. The gap surfaces as intermittent, hard-to-trace failures.

02The gauntlet · six clamps

Every request descends through six independent layers before it reaches the GPU. Any one of them alone prevents a crash; stacked, they make context overflow arithmetically impossible. The rail bleeds from overflow to safe as pressure is shed, layer by layer.

L1Hard cap
catches Runaway generation
Output tokens are clamped to 60% of the context window on every request. No flag disables it.
hard_cap = max(256, ctx * 0.60)
L2Input truncation
catches Prompt bloat
Input is measured against the remaining budget; oversized system prompts are trimmed while the user message is preserved intact.
budget = ctx − max_out − 512
L3Extractive compression
catches Long context
Instead of a blunt cut, lines are scored by importance and the densest content is kept. Pure heuristic — microseconds, no model call.
headers +5 · sigs +3 · kv +2 · overlap +2/term
L4VRAM placement
catches Memory pressure
A 15-second loop watches GPU memory. Below threshold, low-priority models offload to cloud; when it recovers, they return.
P0 orchestrator ▸ P1 vision ▸ P2 coding ▸ P3 reasoning
L5Backend failover
catches Backend crash
Every backend names its fallback. vLLM down → cloud. Cloud down → Ollama on CPU, which cannot OOM. The chain always terminates.
vllm_swap ▸ vllm_cloud ▸ vllm ▸ ollama
L6Cluster auto-recovery
catches Hardware death
Health checks catch a dead GPU within 30 seconds and provision a replacement. No SSH, no pager at 3am.
local ▸ remote mesh ▸ cloud on-demand

03TurboQuant KV cache

3.8× the context, same VRAM.

FP16 KV cache burns 512 bytes per token per layer. TurboQuant packs it with 4-bit vector quantization and fused Triton kernels — under 0.3% perplexity drift, undetectable in real conversation.

shipped as aither-kvcache on PyPI

FP16 baseline32K

TurboQuant122K

tokens of context on one 24GB GPU

04Deploy any HuggingFace model

Paste a model ID. VRAM sizing, vLLM config, GPU provisioning, health monitoring, and the OpenAI endpoint are handled for you.

Gemma 4 32B

128K24 GB

Llama 3.3 70B

128K48 GB

Qwen 3 32B

128K24 GB

DeepSeek R1 70B

64K48 GB

Mistral Large 2

128K80 GB

Any HF model

model defaultauto-sized

05Pick · deploy · use

1 / 3

Pick a model

Catalog or any HF ID. We compute VRAM and find the cheapest GPU across Vast, RunPod, and Lambda that fits.

2 / 3

Deploy

One click provisions the GPU, installs vLLM, loads the model, wires TurboQuant, and streams every step over SSE.

3 / 3

Use it

Point your OpenAI client at the gateway. All six clamps are live. Watch usage, cost, and health from the dashboard.

Deploy a model

Provision a GPU and get an endpoint in minutes.

Read the deep dive

Every clamp, with code from the production system.

managed gpu inference

Deploy any model.
Never crash a GPU.

same code as OpenAI — one URL

python

from openai import OpenAI

client = OpenAI(
    base_url="https://gateway.aitherium.com/v1",
    api_key="aither_sk_...",
)
r = client.chat.completions.create(
    model="gemma-4-32b",
    messages=[...],
)

drop-in for LangChain · LlamaIndex · Cline · Continue · Cursor

Safety clamps

3.8×

Context / VRAM

<30s

Crash → recovery

Overflow crashes

01Why self-hosted LLMs die

Context overflow

One oversized request exhausts the KV cache. The process dies mid-generation and takes the GPU with it.

Nobody is watching

A dead worker needs SSH, a process kill, and re-provisioning — at whatever hour it happened to fall over.

No graceful degrade

When one thing breaks, everything stops. No fallback, no slower-but-alive path to keep serving.

Dialect drift

IDE agents speak OpenAI. Raw vLLM speaks almost-OpenAI. The gap surfaces as intermittent, hard-to-trace failures.

02The gauntlet · six clamps

L1Hard cap
catches Runaway generation
Output tokens are clamped to 60% of the context window on every request. No flag disables it.
hard_cap = max(256, ctx * 0.60)
L2Input truncation
catches Prompt bloat
Input is measured against the remaining budget; oversized system prompts are trimmed while the user message is preserved intact.
budget = ctx − max_out − 512
L3Extractive compression
catches Long context
Instead of a blunt cut, lines are scored by importance and the densest content is kept. Pure heuristic — microseconds, no model call.
headers +5 · sigs +3 · kv +2 · overlap +2/term
L4VRAM placement
catches Memory pressure
A 15-second loop watches GPU memory. Below threshold, low-priority models offload to cloud; when it recovers, they return.
P0 orchestrator ▸ P1 vision ▸ P2 coding ▸ P3 reasoning
L5Backend failover
catches Backend crash
Every backend names its fallback. vLLM down → cloud. Cloud down → Ollama on CPU, which cannot OOM. The chain always terminates.
vllm_swap ▸ vllm_cloud ▸ vllm ▸ ollama
L6Cluster auto-recovery
catches Hardware death
Health checks catch a dead GPU within 30 seconds and provision a replacement. No SSH, no pager at 3am.
local ▸ remote mesh ▸ cloud on-demand

03TurboQuant KV cache

3.8× the context, same VRAM.

FP16 KV cache burns 512 bytes per token per layer. TurboQuant packs it with 4-bit vector quantization and fused Triton kernels — under 0.3% perplexity drift, undetectable in real conversation.

shipped as aither-kvcache on PyPI

FP16 baseline32K

TurboQuant122K

tokens of context on one 24GB GPU

04Deploy any HuggingFace model

Paste a model ID. VRAM sizing, vLLM config, GPU provisioning, health monitoring, and the OpenAI endpoint are handled for you.

Gemma 4 32B

128K24 GB

Llama 3.3 70B

128K48 GB

Qwen 3 32B

128K24 GB

DeepSeek R1 70B

64K48 GB

Mistral Large 2

128K80 GB

Any HF model

model defaultauto-sized

05Pick · deploy · use

1 / 3

Pick a model

Catalog or any HF ID. We compute VRAM and find the cheapest GPU across Vast, RunPod, and Lambda that fits.

2 / 3

Deploy

One click provisions the GPU, installs vLLM, loads the model, wires TurboQuant, and streams every step over SSE.

3 / 3

Use it

Point your OpenAI client at the gateway. All six clamps are live. Watch usage, cost, and health from the dashboard.

Deploy a model

Provision a GPU and get an endpoint in minutes.

Read the deep dive

Every clamp, with code from the production system.

Deploy any model.Never crash a GPU.

Context overflow

Nobody is watching

No graceful degrade

Dialect drift

L1Hard cap

L2Input truncation

L3Extractive compression

L4VRAM placement

L5Backend failover

L6Cluster auto-recovery

3.8× the context, same VRAM.

Pick a model

Deploy

Use it

Deploy any model.Never crash a GPU.

Context overflow

Nobody is watching

No graceful degrade

Dialect drift

L1Hard cap

L2Input truncation

L3Extractive compression

L4VRAM placement

L5Backend failover

L6Cluster auto-recovery

3.8× the context, same VRAM.

Pick a model

Deploy

Use it

Deploy any model.
Never crash a GPU.

Deploy any model.
Never crash a GPU.