No More Guessing: How llmfit Gives Our Agents Hardware-Aware Model Intelligence
Every agent OS has the same dirty secret: model selection is a lie. You pick a model name, hardcode it in a config file, and pray it fits in VRAM. If the user has a different GPU — or no GPU — the agent either crashes, runs at 2 tokens per second, or silently falls back to something useless.
We had this problem across our entire stack. Static profile YAMLs mapping nvidia_high → deepseek-r1:14b, cpu_only → llama3.2:1b. Worked fine on our dev machines. Failed everywhere else. An agent bootstrapping on an RTX 3060 would get the same model as one on a 4090. A Mac Studio with 192GB unified memory got cpu_only because our detection code didn't understand Apple Silicon.
Then we found llmfit.
What llmfit Actually Does
llmfit is a Rust CLI/TUI tool by Alex Jones that solves the model selection problem properly. It:
-
Probes your actual hardware — GPU model, VRAM, CPU cores, RAM, memory bandwidth. On NVIDIA it reads CUDA directly. On Apple Silicon it detects unified memory and Metal capabilities. On AMD it picks up ROCm.
-
Scores 200+ models against that hardware across four dimensions: quality, speed, fit, and context length. Each model gets a composite score and a fit level (
perfect,good,marginal,too_tight). -
Understands quantization — it doesn't just check if a model's raw parameter count fits in VRAM. It evaluates every quantization level (Q4_K_M, Q5_K_S, fp16, etc.) and picks the best one that actually fits, maximizing quality within your hardware constraints.
-
Handles MoE correctly — Mixture-of-Experts models like Mixtral don't load all parameters simultaneously. llmfit knows the difference between total parameters and active parameters, so it doesn't reject a 47B MoE model that only activates 13B at inference time.
-
Serves everything via REST API —
llmfit servestarts a lightweight HTTP server that exposes/api/v1/system,/api/v1/models,/api/v1/models/top, and per-model endpoints. Perfect for sidecar deployment.
The key insight is that model selection is a constraint satisfaction problem, not a lookup table. Given your VRAM, bandwidth, and use case, there's an optimal frontier of models. llmfit computes that frontier in milliseconds.
The Integration Architecture
We integrated llmfit at three layers:
┌─────────────────────────────────────────────────────────┐
│ AitherVeil (Dashboard) │
│ └─ /system → shows llmfit hardware + model scores │
├─────────────────────────────────────────────────────────┤
│ AitherNode (MCP Tools) 6 tools │
│ ├─ llmfit_system → hardware specs │
│ ├─ llmfit_recommend → top models for use_case │
│ ├─ llmfit_model_check → does model X fit? │
│ ├─ llmfit_auto_config → optimal multi-tier config │
│ ├─ llmfit_all_models → full scored catalog │
│ └─ llmfit_vram_estimate → VRAM for a specific model │
├─────────────────────────────────────────────────────────┤
│ aither-adk (Agent Development Kit) │
│ ├─ LLMRouter.model_for_effort() → llmfit-backed │
│ ├─ AgentSetup.detect_hardware() → llmfit validation │
│ ├─ AgentSetup.pull_models() → llmfit recommended │
│ └─ Config.get_llmfit_client() → wired into config │
├─────────────────────────────────────────────────────────┤
│ llmfit sidecar (Rust, port 8793) │
│ └─ REST API: /health, /api/v1/system, /api/v1/models │
└─────────────────────────────────────────────────────────┘
Layer 1: Docker Sidecar
llmfit runs as aither-llmfit in our Docker Compose stack. The Dockerfile downloads a pre-built binary from GitHub Releases — no Rust toolchain needed at build time. It probes hardware via /proc, sysfs, nvidia-smi, and pciutils. GPU passthrough is not required for detection — it reads PCIe device IDs through sysfs.
aither-llmfit:
build:
context: .
dockerfile: docker/services/Dockerfile.llmfit
ports:
- "8793:8793"
environment:
- OLLAMA_HOST=http://host.docker.internal:11434
deploy:
resources:
limits:
memory: 512M
512MB memory limit. No GPU reservation. Starts in under 2 seconds. The entire container image is ~30MB because it's a single static Rust binary on debian:trixie-slim.
Layer 2: AitherNode MCP Tools
We created six MCP-callable tools so any agent or desktop client can query llmfit:
@mcp.tool()
async def llmfit_recommend(use_case: str = "general", min_fit: str = "good",
limit: int = 5, sort: str = "score") -> str:
"""Get hardware-optimal model recommendations."""
client = _get_client()
models = await client.get_top_models(
use_case=use_case, min_fit=min_fit, limit=limit, sort=sort
)
return json.dumps([...])
An agent asking "what model should I use for coding?" now gets a real answer based on the machine it's running on, not a hardcoded string.
The llmfit_auto_config tool is particularly powerful — it generates a complete multi-tier configuration mapping effort levels to hardware-optimal models:
{
"hardware": {"gpu": "RTX 4090", "vram_gb": 24, "ram_gb": 64},
"fast": {"model": "llama3.2:3b", "tps": 120},
"balanced": {"model": "nemotron-orchestrator-8b", "tps": 45},
"reasoning": {"model": "deepseek-r1:14b", "tps": 28},
"coding": {"model": "qwen2.5-coder:14b", "tps": 32}
}
Layer 3: ADK Agent Bootstrap
This is where it gets interesting. The ADK (Agent Development Kit) is our open-source framework for building agents. When an agent bootstraps with auto_setup(), the flow now looks like:
-
detect_hardware() runs native detection (nvidia-smi, /proc/meminfo, etc.) and cross-validates with llmfit. If llmfit reports different VRAM than nvidia-smi (which happens with driver version mismatches), it trusts llmfit's value — llmfit reads the GPU's BAR1 memory size directly.
-
pull_models() queries llmfit for hardware-scored Ollama model names instead of using static lookup tables. A machine with 8GB VRAM gets different recommendations than one with 24GB — not because of a profile name, but because llmfit computed what actually fits.
-
LLMRouter.model_for_effort() maps effort levels (1-10) to model tiers, and each tier queries llmfit for the optimal model. Priority chain: explicit config → hardware profile → llmfit → static defaults.
# Before: static lookup
_EFFORT_MODELS = {
"fast": "llama3.2:3b",
"balanced": "nemotron-orchestrator-8b",
...
}
# After: hardware-scored
config = await get_llmfit().recommend_config()
if config.get("balanced"):
return config["balanced"]["model"] # Best model that fits YOUR hardware
The fallback chain is important. If llmfit isn't running (standalone ADK without Docker), everything degrades gracefully to the static profile system. Zero hard dependencies.
The Bug That Started It All
The original LLMFitClient.py had a port bug that perfectly illustrates why integration details matter. The Docker container was configured to serve on port 8793 (AitherOS convention), and services.yaml mapped it correctly, but the Python client had 8787 (upstream default) hardcoded in its Docker-mode URL resolution:
# Before (broken)
if os.environ.get("AITHER_DOCKER_MODE"):
return f"http://aither-llmfit:8787" # Wrong! Container serves on 8793
# After (fixed)
if os.environ.get("AITHER_DOCKER_MODE"):
return f"http://aither-llmfit:8793"
This meant every inter-container llmfit call was silently failing, and the entire system was falling back to static heuristics. Nobody noticed because the fallback was designed to be seamless. A testament to good degradation design — and a warning that graceful fallbacks can hide real bugs indefinitely.
What Changes for Users
If you're running AitherOS with the core or gpu Docker profile, llmfit is already active. Your agents are already getting hardware-scored model recommendations instead of static profiles.
If you're using the ADK standalone:
from adk.setup import auto_setup
report = await auto_setup()
# report.models_available now contains llmfit-recommended models
# (if llmfit sidecar is running), or profile-based defaults
You can also run llmfit directly:
# Install (Rust)
cargo install llmfit
# Or download pre-built binary
curl -fsSL https://github.com/Aitherium/llmfit/releases/latest/download/llmfit-linux-x86_64.tar.gz | tar xz
# Interactive TUI
llmfit
# REST API
llmfit serve --port 8793
# Point the ADK at it
export AITHER_LLMFIT_URL=http://localhost:8793
Credit Where It's Due
llmfit is built by Alex Jones and is open source under the MIT license. The repository is at github.com/Aitherium/llmfit.
What impressed us most about llmfit is that it solves the right problem at the right abstraction level. It doesn't try to be a model registry or an inference server. It answers one question — "which models can this machine run, and how well?" — and answers it correctly, accounting for quantization, MoE architectures, memory bandwidth, and multi-GPU setups. That's exactly the primitive we needed.
The Rust implementation means it starts instantly, uses negligible memory, and the binary is fully self-contained. No Python runtime, no pip dependencies, no Docker-in-Docker. Just a 10MB static binary that reads your hardware and tells you what fits.
We contributed our integration patterns back and hope other agent frameworks adopt the same approach. Static model lists in config files are a solved problem now. Use llmfit.
What's Next
We're exploring deeper integration:
- MicroScheduler awareness — the LLM scheduler (port 8150) that manages VRAM allocation across concurrent requests could use llmfit's per-model memory estimates to make better scheduling decisions.
- Dynamic model swapping — if VRAM pressure increases (more agents requesting inference), automatically downgrade to a smaller model that llmfit scores as the next-best fit.
- Training pipeline integration — Prism (our training orchestrator) could query llmfit to determine which base models can be fine-tuned on the current hardware, including LoRA VRAM overhead.
- Fleet-wide scoring — in multi-node deployments, aggregate llmfit data from all nodes to build a cluster-wide model placement strategy.
The fundamental shift is from "configure models per machine" to "let the machine tell you what it can run." llmfit makes that possible.