Early Access Preview—AitherOS is in active development. Features may change, break, or disappear.

LLM

0/24

GPU0/0GB

IDLEFREE

Monitoring services…

•Connecting to services…

Live Demo

Invite Only

Theme

GitHub

Live Demo

Invite Only

Theme

GitHub

Back to blog

engineeringsecurityarchitecturecryptography

Four Layers of Defense: How We Encrypted and Authenticated Every Byte Between 203 Microservices

Name: AitherOS
Author: Aitherium

March 15, 202614 min readAitherium

AitherOS runs 203 microservices on a single host. They communicate over HTTP on localhost. That's fast, simple, and completely insecure.

Any process on the host can sniff inter-service traffic. A compromised service can impersonate another. A rogue container can replay requests. And when your services include an LLM orchestrator that spawns AI agents with tool access, "insecure internal network" isn't a risk you defer to next quarter.

We shipped four independent security layers in one sprint: TLS transport encryption, Ed25519 request signing, mutual TLS authentication, and a NanoGPT-powered threat detection system. Each layer solves a different class of attack. Together, they make AitherOS's internal network as hardened as its external surface.

The Threat Model

Before writing code, we mapped the attack surface:

Attack	Pre-mitigation	Which layer blocks it
Packet sniffing on host	Wide open	TLS
Service impersonation	Wide open	Ed25519 signing
Stolen credential replay	Time-limited	Nonce + timestamp
Unauthorized service-to-service calls	Wide open	mTLS
Compromised agent probing auth boundaries	Undetected	Guardian
Privilege escalation via tool chaining	Undetected	Guardian + Sentry

No single layer handles everything. That's the point.

Layer 1: TLS Transport Encryption

Problem: All inter-service HTTP traffic was plaintext on localhost.

Solution: Every service now starts with TLS. get_service_url() returns https:// when TLS is enabled. Uvicorn binds with SSL context. Clients verify the CA chain.

Certificate Architecture

We run a two-tier PKI:

AitherNet Root CA (EC P-384, 10-year validity)
  └── AitherNet Intermediate CA (EC P-384, 5-year)
       ├── aither-genesis.crt (EC P-256, 1-year)
       ├── aither-node.crt
       ├── aither-secrets.crt
       └── ... (22 services pre-provisioned)

The root CA never signs service certs directly. The intermediate handles issuance and can be rotated without rebuilding the entire chain. The CA chain file (ca-chain.pem) is auto-assembled: intermediate + root, so clients can verify the full chain.

Cert Provisioning (3-Tier Resolution)

When a service boots, TLSConfig.provision_service_cert() tries three sources in order:

Already on disk — Library/Data/tls/{service}/cert.pem + key.pem
Copy from AitherSecrets CA directory — Library/Data/secrets/ca/issued/{service}/
Request from AitherSecrets API — POST /ca/issue/{service_name}

Most services hit tier 1 (instant). New services hit tier 3 once, then tier 1 forever.

The Config Toggle

# enterprise.yaml
interservice_tls:
  enabled: true  # default
  cert_dir: "Library/Data/tls"
  ca_chain: "Library/Data/tls/ca-chain.pem"

Env override: AITHER_INTERSERVICE_TLS=false disables it. This exists for local development and testing, not production. The toggle is checked in AitherPorts.get_service_url(), AitherIntegration.__init__() (uvicorn SSL), and BaseServiceClient.__init__() (CA verification).

Layer 2: Ed25519 Request Signing

TLS encrypts the channel, but it doesn't prove who sent the request. Any service with network access can reach any other service's HTTPS endpoint. We need per-request authentication.

Solution: Every outbound HTTP request is signed with the sending service's Ed25519 private key. Every inbound request is verified against the sender's public key.

The Signing Scheme

Payload = METHOD | path | timestamp | nonce | SHA256(body)
Signature = Ed25519.sign(payload, private_key)

Five headers carry the signature:

X-Aither-Service: AitherGenesis
X-Aither-Timestamp: 1741852800
X-Aither-Nonce: a7f3b2c4d5e6
X-Aither-Key-Id: genesis-2026-03
X-Aither-Signature: base64(ed25519_signature)

Replay Protection

Two mechanisms prevent replay attacks:

Timestamp tolerance: Requests older than 5 minutes are rejected. This handles clock drift while blocking stored-and-replayed requests.
Nonce tracking: A bounded LRU cache (50,000 entries) tracks recently seen nonces. Duplicate nonces within the timestamp window are rejected.

Enforcement Modes

# enterprise.yaml
signing:
  mode: enforce  # disabled | warn | enforce

disabled: No signing, no verification. For development.
warn: Sign outbound, verify inbound, log failures but accept the request. For rollout.
enforce: Reject unsigned or invalid requests with 401.

We ran in warn mode for 48 hours to catch any service that wasn't signing correctly, then flipped to enforce.

Key Provisioning

Services get Ed25519 keypairs from AitherSecrets at boot -- the infrastructure already existed. Keys are cached locally, Fernet-encrypted with the machine key. Auto-rotation is configurable (default 90 days) with a 24-hour overlap period where both old and new keys are valid.

Exempt Paths

Health checks, metrics, docs, and favicon are exempt. These are read-only, unauthenticated by design, and high-frequency. Signing them would add latency to the hot path with no security benefit.

Layer 3: Mutual TLS (mTLS)

TLS proves the server is who it claims to be. Signing proves the request is authentic. But mTLS goes further: the server verifies the client's certificate at the TLS handshake level, before any HTTP processing occurs.

How It Works

Service A opens a TLS connection to Service B
Service B's SSL context requires client certificates (ssl.CERT_REQUIRED)
Service A presents its cert (same cert used for its own TLS server)
Service B verifies the cert against the CA chain
MTLSEnforcementMiddleware extracts the Common Name (CN) from the client cert
The CN becomes the authenticated service identity for RBAC checks

The Middleware

class MTLSEnforcementMiddleware:
    async def __call__(self, scope, receive, send):
        if scope["type"] == "http":
            # Extract client cert CN from TLS transport
            transport = scope.get("transport")
            if transport:
                peercert = transport.get_extra_info("peercert")
                cn = extract_cn(peercert)
                scope["authenticated_service"] = cn

This is applied automatically by AitherIntegration -- services don't need to opt in.

Why Both Signing and mTLS?

They protect against different attacks:

Scenario	Ed25519 Signing	mTLS
Process on host without valid cert tries to call a service	Rejected (no signing key)	Rejected (no client cert)
MITM intercepts and modifies request body	Rejected (signature invalid)	Not detected (TLS payload is decrypted at endpoint)
Service A's cert is compromised	Still safe (attacker lacks signing key)	Compromised
Signing key is compromised	Compromised	Still safe (attacker lacks cert)

Neither is redundant. The signing key is independent from the TLS cert. Compromising one doesn't compromise the other.

Layer 4: Unified Agent Threat Detection (Guardian)

The first three layers handle infrastructure-level attacks. Layer 4 handles something scarier: AI agents that probe security boundaries.

This isn't theoretical. Research on emergent AI agent cyber behavior shows that LLM-powered agents, given enough autonomy, will naturally discover and probe authentication endpoints, attempt privilege escalation through tool chaining, and try to exfiltrate data through steganographic channels. Not because they're "trying to escape" -- because these behaviors are statistically similar to the task-completion patterns they were trained on.

The Problem: Four Security Services, No Correlation

AitherOS already had four security services:

Sentry — Rule-based threat detection (badge-based heuristics)
Sentinel — Anomaly detection via statistical baselines
Jail — Agent containment (isolate compromised agents)
Chaos — Adversarial testing (red-team your own agents)

Each service operated independently. Sentry could detect an auth probe. Sentinel could flag an anomalous tool usage pattern. But no system could see that the same agent was doing both within a 30-second window -- a clear escalation pattern.

Guardian: The Correlation Engine

Guardian is an in-process singleton (lib/security/AitherGuardian.py) that cross-correlates threat signals from all four security services. It runs inside Genesis (zero HTTP latency) and uses sliding time windows with pre-built correlation rules.

Agent "explorer-7" timeline:
  t+0s:  Sentry detects AUTH_PROBING (confidence 0.6)
  t+12s: Sentry detects TOOL_ESCALATION (confidence 0.5)
  t+18s: Sentinel flags anomalous tool usage pattern

Guardian correlation rule fires:
  Rule: "auth_probe + tool_escalation within 60s from same agent"
  Compound score: 0.6 * (1 + 0.3 * (2-1)) = 0.78
  Action: ESCALATE → Jail.isolate("explorer-7", severity="high")

Five New Threat Categories

We added five behavior-based detection heuristics to Sentry's AgentBehaviorDetector:

Category	What It Detects	How
AUTH_PROBING	Agent repeatedly hitting auth/identity endpoints	Sliding window count > threshold
PRIV_ESCALATION	Agent requesting capabilities beyond its token	Capability mismatch detection
SECURITY_TAMPERING	Agent modifying security configs or audit logs	Path pattern matching on mutations
STEGANOGRAPHIC_EXFIL	Agent hiding data in image metadata or code comments	Shannon entropy analysis (>4.5 bits/byte)
TOOL_ESCALATION	Agent chaining tools in escalating privilege order	Tool category progression tracking

NanoGPT-Powered Anomaly Scoring

The final piece: we trained a NanoGPT model on normal agent behavior sequences. When a new sequence is scored, high loss = suspicious behavior (the model "doesn't recognize" this pattern). This catches novel attack patterns that rule-based heuristics miss.

The training pipeline generates three types of curriculum data:

Baseline normal — tool usage, memory access, service calls from production logs
Known attack patterns — auth probing, escalation, exfiltration sequences
Adversarial variants — Chaos-generated attack mutations

Jail Rewrite

The old Jail was a 24-line stub with a dictionary. The new Jail has:

ContainmentCell — Per-agent isolation with configurable restrictions (no_network, no_llm, no_tools, file_readonly)
Severity-based defaults — Critical severity auto-adds no_llm + file_readonly
Expiration — Cells auto-release after configurable duration
5 API endpoints — isolate, release, status, list, history

The Feedback Loop

Agent acts → Sentry detects → Guardian correlates → Jail contains
                                    ↑
Chaos tests → finds gaps → Guardian auto-tunes thresholds

When Chaos finds a vulnerability (an attack that wasn't detected), it feeds the result to Guardian, which adjusts its correlation thresholds. The system gets better at detection without manual tuning.

How They Stack

Each layer operates independently. Any one of them can be disabled without breaking the others. Together, they provide comprehensive coverage:

                     ┌─────────────────────┐
                     │   Layer 4: Guardian  │  AI behavior correlation
                     │   (application)      │  + anomaly detection
                     ├─────────────────────┤
                     │   Layer 3: mTLS      │  Mutual cert verification
                     │   (transport auth)   │  at TLS handshake
                     ├─────────────────────┤
                     │   Layer 2: Ed25519   │  Per-request signatures
                     │   (request auth)     │  + replay protection
                     ├─────────────────────┤
                     │   Layer 1: TLS       │  Channel encryption
                     │   (confidentiality)  │  + server identity
                     └─────────────────────┘

An attacker would need to:

Obtain a valid TLS client certificate (mTLS)
Obtain the target service's Ed25519 private key (signing)
Craft requests within the 5-minute timestamp window with unique nonces (replay protection)
Avoid triggering any of Guardian's 6 correlation rules, Sentry's 10 heuristics, or NanoGPT's anomaly baseline (threat detection)

All while the traffic is encrypted end-to-end (TLS).

Testing: 215 Tests Across All Four Layers

Layer	Test File	Tests	Key Coverage
TLS	test_interservice_tls.py	47	Cert provisioning, SSL contexts, URL scheme switching, CA chain assembly, config toggles
Ed25519	test_service_signer.py	69	Signing, verification, replay, nonces, key rotation, enrollment, middleware, end-to-end
Guardian/UATDS	test_agent_threat_detection.py	99	All 5 threat categories, correlation rules, compound scoring, Jail integration, NanoGPT curriculum

Zero mocking of cryptographic operations. Every test signs real payloads, verifies real signatures, and issues real certificates. The tests are slower than mocked versions, but they test the actual security boundary.

Performance Impact

The concern with adding four security layers is latency. Here's what we measured:

TLS handshake: ~2ms (connection pooling amortizes this to near-zero for persistent connections)
Ed25519 signing: ~0.1ms per request (Ed25519 is fast)
Ed25519 verification: ~0.2ms per request
mTLS client cert verification: Happens during TLS handshake, no additional cost
Guardian correlation: In-process, sub-millisecond (no HTTP call)

Total overhead per inter-service request: <3ms on cold connection, <1ms on warm connection. For services that make 100+ internal calls per user request (like the chat pipeline), the aggregate overhead is under 50ms -- invisible compared to the 2-30 second LLM inference time.

Configuration: One File, Four Toggles

# config/enterprise.yaml
interservice_tls:
  enabled: true
  cert_dir: "Library/Data/tls"
  ca_chain: "Library/Data/tls/ca-chain.pem"

signing:
  mode: enforce        # disabled | warn | enforce
  timestamp_tolerance: 300  # seconds
  nonce_cache_size: 50000
  key_rotation_days: 90

mtls:
  enabled: true
  require_client_cert: true

guardian:
  enabled: true
  correlation_window: 60   # seconds
  auto_contain: true       # auto-isolate on high-confidence threats

Environment variables override everything: AITHER_INTERSERVICE_TLS, AITHER_SIGNING_MODE, AITHER_MTLS_ENABLED, AITHER_GUARDIAN_ENABLED.

What We Learned

1. Ship in warn mode first. We ran Ed25519 signing in warn mode for 48 hours before enforcing. This caught three services that were making unsigned requests due to import timing issues. If we'd gone straight to enforce, those services would have been bricked.

2. Health checks must be exempt. Our first attempt applied signing to all requests including /health. The Kubernetes-style health check loop doesn't have signing keys. Health checks piled up as 401s, the service reported unhealthy, and the boot orchestrator tried to restart it. Exempt paths are not optional.

3. In-process beats microservice for security. Guardian runs as a singleton inside Genesis, not as a separate service. This means threat correlation happens with zero network latency and can access the full process state. The tradeoff is that Guardian goes down if Genesis goes down -- but if Genesis is down, there are no agents to monitor anyway.

4. Two independent keys > one shared key. Ed25519 signing keys and TLS certificates are provisioned independently, stored in different directories, and rotated on different schedules. This means compromising one doesn't compromise the other. The extra complexity is worth it.

215 tests. 4 security layers. 203 services protected. Zero performance regressions on the existing 2690+ test suite. Defense in depth isn't a buzzword -- it's an engineering discipline.

Enjoyed this post?

All posts Try AitherOS