Early Access Preview—AitherOS is in active development. Features may change, break, or disappear.

LLM

0/24

GPU0/0GB

IDLEFREE

Monitoring services…

•Connecting to services…

Live Demo

Invite Only

Theme

GitHub

Live Demo

Invite Only

Theme

GitHub

Back to blog

engineeringtoken-economygpuinfrastructureaitherveilcommunityresource-tracking

Real-Time Token Economy: See Your Impact on the Grid

Name: AitherOS
Author: Aitherium

March 12, 202612 min readAitherium

Everything on Aitherium is free right now.

Zero tokens. Every chat, every image generation, every agent task — free. We're in alpha. The GPU pool is running, the models are loaded, and nobody pays anything.

But "free" doesn't mean "no cost." Somebody is paying for the electricity, the VRAM, the inference time. Right now that's us. And we wanted people to see that — not as some abstract billing page, but as a live, real-time system they can watch and feel.

So we built a token economy tracker that polls the actual GPU pool, shows the real queue depth, computes a live cost multiplier using an S-curve model, and — when the system hits capacity — shows exactly what it would take to spin up another instance.

This is the story of how it works.

The Problem With Static Pricing Pages

Most platforms show you a pricing table. Flat rates. Maybe tiers. The price is the price.

That model breaks down the moment you're running your own GPU pool. The real cost of inference is:

variable — it depends on how many requests are in flight
competitive — VRAM is finite, and two 70B model loads can't share the same 24GB card
spiky — one user running a swarm of 11 coding agents is very different from one user asking "what's the weather"

Static pricing either overcharges during quiet periods or undercharges during peaks. Neither feels honest.

We wanted something different: a cost model that is visually transparent, that updates in real time, and that makes the economics of running inference visible to the people using it.

Seven Services, One API Route

The token economy tracker is a single Next.js API route at /api/system/token-economy. When it runs, it fans out to seven backend services simultaneously:

┌──────────────────────────────────────────────────────┐
│                   Token Economy API                   │
│                 (Next.js route.ts)                    │
├──────────────────────────────────────────────────────┤
│                                                       │
│  Promise.all([                                        │
│    Genesis:8001          → LLM queue + agents + slots │
│    MicroScheduler:8150   → capacity + models + queue  │
│    AitherCompute:8168    → GPU devices + VRAM + cloud │
│    AitherAccel:8103      → GPU acceleration + metrics │
│    AitherParallel:8100   → parallel stream stats      │
│    StrataData:8136/mint  → token supply + pool rates  │
│    AitherACTA:8206       → compute pressure + limits  │
│  ])                                                   │
│                                                       │
│  → S-curve pricing → capacity model → response        │
└──────────────────────────────────────────────────────┘

Each service call is individually wrapped in a try-catch with a 2.5-second AbortController timeout. The route prefers Genesis for LLM queue data but falls back to MicroScheduler if Genesis is down. GPU metrics prefer Accel's real-time data with Compute as a fallback. If Mint (running inside the StrataData compound on port 8136) is unreachable, we fall back to default token rates. The response includes a services_online health map — seven booleans, one per backend — so the frontend knows exactly which data sources are live.

Fifteen endpoints. Seven services. One parallel fan-out. The whole thing resolves in under 500ms on a healthy stack.

The S-Curve Pricing Model

Flat pricing is a lie. Linear pricing is brutal. We use a sigmoid (S-curve) function:

\text{multiplier} = \text{floor} + \frac{\text{ceiling} - \text{floor}}{1 + e^{-\text{steepness} \cdot (x - \text{midpoint})}}

With our current parameters:

Floor: 1.0× (minimum cost, the system is idle)
Ceiling: 6.0× (maximum cost, the system is saturated)
Steepness: 8.0 (how sharp the transition is)
Midpoint: 0.55 (55% utilization is the inflection point)

What this means in practice:

GPU Utilization	Cost Multiplier	Feel
0–30%	~1.0×	Cheap. Plenty of capacity.
30–50%	1.0–1.5×	Starting to climb.
50–60%	1.5–3.5×	Steep increase. The curve is biting.
60–80%	3.5–5.5×	Expensive. Queue is real.
80–100%	5.5–6.0×	Ceiling. Maximum rate.

The S-curve is important because it models economic pressure correctly. When the system is mostly idle, adding one more request costs almost nothing. When it's 80% loaded, every additional request fights for scarce VRAM and queues behind existing jobs. The curve captures that nonlinearity naturally.

The variable $x$ isn't just "GPU percentage." It's a weighted composite:

const gpuPressure = vramUsed / vramTotal;
const queuePressure = queueDepth / maxSlots;
const utilization = Math.max(gpuPressure, queuePressure * 0.9);

When ACTA is online, the route prefers its compute_pressure endpoint — ACTA factors in per-user rate limits and resource-level pressure data that the local calculation can't see. The local S-curve acts as a fallback when ACTA is unreachable.

Queue depth matters almost as much as raw GPU utilization. A system with 40% VRAM used but 15 requests queued is under more pressure than a system with 60% VRAM and zero queue.

What the User Sees

The token economy surfaces in three places:

1. The Economy Banner

Every page that involves Aitherium tokens — status, compute, chat — gets a banner that operates in three modes:

🟢 Normal (green): "Alpha — Everything is Free." The system is healthy. An expandable panel shows the live tracker if you're curious.
🟡 Pressure (amber): "High Demand — Costs Are Elevated." Appears when capacity exceeds 80% or when back-pressure is detected. Shows the current multiplier.
🔴 Saturated (red): "Compute is Full." The queue is maxed. The banner transforms into a capacity status panel with a progress bar showing how close we are to needing another instance.

The mode transitions are automatic. No manual intervention. The system's state drives the UI.

2. The Compact Tracker

In the system activity panel, a single-line compact tracker shows: GPU utilization percentage, VRAM usage, queue depth, current multiplier, base cost, active cloud instances, and a FULL indicator when saturated. It's designed to be glanceable — a quick health check without expanding anything.

3. The Full Tracker

Click to expand and you get the full picture:

GPU Pool: A VRAM bar per physical GPU device, showing used vs. total with per-device utilization percentages. If you have four GPUs, you see four bars.
LLM Slots: A segmented bar showing active inference slots (green), queued requests (amber), and available capacity (gray). Plus chips for each loaded model.
Metric Grid: Four cells — queue depth, cost multiplier, base cost per request, active cloud instances.
Capacity Panel: When the system needs more capacity, this shows the resource threshold, a live progress bar toward auto-scaling, and the auto-scale-down timer.
Cloud Instances: A list of active cloud instances with their GPU type, region, and status.

Demand-Driven Capacity Scaling

This is the part we're most excited about.

When the local GPU pool is saturated — VRAM full, queue maxed, multiplier at ceiling — the system doesn't just throw a "try again later" message. Instead, it calculates exactly what it would cost to deploy an additional cloud GPU instance and shows the user that information.

The math:

1 USD ≈ 5,000 Aitherium tokens
Cloud GPU ≈ $0.80/hour ≈ 4,000 tokens/hour

The capacity panel shows:

Tokens equivalent: The compute cost of the next cloud GPU instance, expressed in tokens
Progress: How close current demand is to triggering an auto-scale event
Auto-scale-down: When demand drops, the extra capacity automatically winds down — no wasted compute

The plumbing is real. The API route calculates capacity_scaling fields from Mint's token pool data (supply, circulation) and Compute's cloud provisioning data (available GPU types via Vast.ai, current instances, hourly cost). Right now everything_free is hardcoded true in the route — which zeros out effective costs — but the calculation runs on every poll so we can flip the switch without redeploying anything.

The key insight is that the users generating the most demand see exactly what that demand costs in real compute. Full transparency into the relationship between usage and infrastructure load.

The Six Demand Tiers

The tracker classifies system state into six tiers, each with its own color language:

Tier	Color	Condition	Message
idle	Emerald	<10% utilization	"System idle"
low	Green	10–40%	"Low demand"
moderate	Blue	40–60%	"Moderate demand"
high	Amber	60–80%	"High demand — costs rising"
surge	Orange	80–95%	"Surge — near capacity"
saturated	Red	>95% or backpressure	"FULL — capacity limit reached"

The tiers drive not just color but behavior. The banner switches modes. The compact tracker shows different indicators. The full tracker reveals or hides the capacity panel. Everything is state-driven from a single utilization number.

Graceful Degradation

Not every service will be up all the time. The API route handles this by:

Individual timeouts: Each of the fifteen fetch calls gets its own 2.5-second AbortController. One slow service doesn't block the others.
Fallback chains: LLM queue data prefers Genesis, falls back to MicroScheduler. GPU metrics prefer Accel, fall back to Compute. The S-curve multiplier prefers ACTA's compute_pressure, falls back to local calculation. You get less detail, not an error page.
Health map: The response includes services_online: { genesis: true, microscheduler: true, compute: true, accel: true, parallel: true, mint: true, acta: true } so the frontend knows which panels to dim or hide.
Safe defaults: Every metric has a zero-state default. The hook initializes with a DEFAULT_ECONOMY object that renders a valid, boring "everything is idle" view.

This matters because the token economy is shown on high-traffic pages (status, chat). A backend hiccup shouldn't crash the status page.

Polling Strategy

The useTokenEconomy hook polls every 4 seconds by default. The economy banner component overrides this to 5 seconds since it's on every page and doesn't need sub-second freshness. That might sound aggressive for either, but:

The payload is small (~2KB JSON)
The API route resolves in ~200–500ms (parallel fan-out, not sequential)
The UI benefits from near-real-time updates (watching your GPU bar climb as you queue a batch job is satisfying)
We use setInterval with cleanup, not WebSockets, because the data is aggregate telemetry, not per-user state
On fetch failure, the hook silently keeps the last good data — no flicker, no error states

Future improvement: switch to Server-Sent Events for push-based updates when the stack supports it. For now, 4–5 second polling strikes the right balance between freshness and backend load.

Why This Matters

Most AI platforms are black boxes. You send a request, you get a response, you get a bill. You have no idea what happened in between.

We think transparency is a feature. If the GPU pool is 90% loaded, you should know that. If your request is queued behind 14 others, you should see that. If the system is close to needing another GPU to clear the backlog, you should see exactly how much compute that would take and when extra capacity will spin down.

The token economy tracker isn't just a monitoring tool. It's a statement about how we think AI infrastructure should work: visibly, openly, and with full awareness of what every request actually costs in real compute.

Right now everything is free. The everything_free flag is hardcoded true. The multiplier is 1.0×. The GPU bars are calm. The queue is empty. We're running a single local GPU — one RTX 4090 with 24GB of VRAM — and that's enough for the alpha.

But the tracker isn't mocked. The seven services are real. The fifteen endpoints return real data. The S-curve computes a real multiplier from real utilization (it just multiplies by zero). The Saga page's LIVE INFRASTRUCTURE panel polls the Saga backend every five seconds and shows actual GPU node count, actual VRAM, and actual queue depth. The Vast.ai provisioning client exists and is tested. The Mint pool balance is tracked.

When we flip everything_free to false, the economy goes live without a redeploy. When the queue fills up and VRAM gets scarce, the tracker will show exactly what's happening and exactly what it takes to scale — because it's already showing it, just with all the costs zeroed out.

That's the point. Not hiding the cost. Building the system that shows it, right from the start.

The token economy tracker is live in alpha at aitherium.com. Everything is currently free — one local GPU, seven backend services polled in real time, all costs zeroed. The infrastructure is real. The tracking is live.

Enjoyed this post?

All posts Try AitherOS