Four Layers of Defense: How We Encrypted and Authenticated Every Byte Between 203 Microservices
AitherOS runs 203 microservices on a single host. They communicate over HTTP on localhost. That's fast, simple, and completely insecure.
Any process on the host can sniff inter-service traffic. A compromised service can impersonate another. A rogue container can replay requests. And when your services include an LLM orchestrator that spawns AI agents with tool access, "insecure internal network" isn't a risk you defer to next quarter.
We shipped four independent security layers in one sprint: TLS transport encryption, Ed25519 request signing, mutual TLS authentication, and a NanoGPT-powered threat detection system. Each layer solves a different class of attack. Together, they make AitherOS's internal network as hardened as its external surface.
The Threat Model
Before writing code, we mapped the attack surface:
| Attack | Pre-mitigation | Which layer blocks it |
|---|---|---|
| Packet sniffing on host | Wide open | TLS |
| Service impersonation | Wide open | Ed25519 signing |
| Stolen credential replay | Time-limited | Nonce + timestamp |
| Unauthorized service-to-service calls | Wide open | mTLS |
| Compromised agent probing auth boundaries | Undetected | Guardian |
| Privilege escalation via tool chaining | Undetected | Guardian + Sentry |
No single layer handles everything. That's the point.
Layer 1: TLS Transport Encryption
Problem: All inter-service HTTP traffic was plaintext on localhost.
Solution: Every service now starts with TLS. get_service_url() returns https:// when TLS is enabled. Uvicorn binds with SSL context. Clients verify the CA chain.
Certificate Architecture
We run a two-tier PKI:
AitherNet Root CA (EC P-384, 10-year validity)
└── AitherNet Intermediate CA (EC P-384, 5-year)
├── aither-genesis.crt (EC P-256, 1-year)
├── aither-node.crt
├── aither-secrets.crt
└── ... (22 services pre-provisioned)
The root CA never signs service certs directly. The intermediate handles issuance and can be rotated without rebuilding the entire chain. The CA chain file (ca-chain.pem) is auto-assembled: intermediate + root, so clients can verify the full chain.
Cert Provisioning (3-Tier Resolution)
When a service boots, TLSConfig.provision_service_cert() tries three sources in order:
- Already on disk —
Library/Data/tls/{service}/cert.pem+key.pem - Copy from AitherSecrets CA directory —
Library/Data/secrets/ca/issued/{service}/ - Request from AitherSecrets API —
POST /ca/issue/{service_name}
Most services hit tier 1 (instant). New services hit tier 3 once, then tier 1 forever.
The Config Toggle
# enterprise.yaml
interservice_tls:
enabled: true # default
cert_dir: "Library/Data/tls"
ca_chain: "Library/Data/tls/ca-chain.pem"
Env override: AITHER_INTERSERVICE_TLS=false disables it. This exists for local development and testing, not production. The toggle is checked in AitherPorts.get_service_url(), AitherIntegration.__init__() (uvicorn SSL), and BaseServiceClient.__init__() (CA verification).
Layer 2: Ed25519 Request Signing
TLS encrypts the channel, but it doesn't prove who sent the request. Any service with network access can reach any other service's HTTPS endpoint. We need per-request authentication.
Solution: Every outbound HTTP request is signed with the sending service's Ed25519 private key. Every inbound request is verified against the sender's public key.
The Signing Scheme
Payload = METHOD | path | timestamp | nonce | SHA256(body)
Signature = Ed25519.sign(payload, private_key)
Five headers carry the signature:
X-Aither-Service: AitherGenesis
X-Aither-Timestamp: 1741852800
X-Aither-Nonce: a7f3b2c4d5e6
X-Aither-Key-Id: genesis-2026-03
X-Aither-Signature: base64(ed25519_signature)
Replay Protection
Two mechanisms prevent replay attacks:
- Timestamp tolerance: Requests older than 5 minutes are rejected. This handles clock drift while blocking stored-and-replayed requests.
- Nonce tracking: A bounded LRU cache (50,000 entries) tracks recently seen nonces. Duplicate nonces within the timestamp window are rejected.
Enforcement Modes
# enterprise.yaml
signing:
mode: enforce # disabled | warn | enforce
- disabled: No signing, no verification. For development.
- warn: Sign outbound, verify inbound, log failures but accept the request. For rollout.
- enforce: Reject unsigned or invalid requests with 401.
We ran in warn mode for 48 hours to catch any service that wasn't signing correctly, then flipped to enforce.
Key Provisioning
Services get Ed25519 keypairs from AitherSecrets at boot -- the infrastructure already existed. Keys are cached locally, Fernet-encrypted with the machine key. Auto-rotation is configurable (default 90 days) with a 24-hour overlap period where both old and new keys are valid.
Exempt Paths
Health checks, metrics, docs, and favicon are exempt. These are read-only, unauthenticated by design, and high-frequency. Signing them would add latency to the hot path with no security benefit.
Layer 3: Mutual TLS (mTLS)
TLS proves the server is who it claims to be. Signing proves the request is authentic. But mTLS goes further: the server verifies the client's certificate at the TLS handshake level, before any HTTP processing occurs.
How It Works
- Service A opens a TLS connection to Service B
- Service B's SSL context requires client certificates (
ssl.CERT_REQUIRED) - Service A presents its cert (same cert used for its own TLS server)
- Service B verifies the cert against the CA chain
MTLSEnforcementMiddlewareextracts the Common Name (CN) from the client cert- The CN becomes the authenticated service identity for RBAC checks
The Middleware
class MTLSEnforcementMiddleware:
async def __call__(self, scope, receive, send):
if scope["type"] == "http":
# Extract client cert CN from TLS transport
transport = scope.get("transport")
if transport:
peercert = transport.get_extra_info("peercert")
cn = extract_cn(peercert)
scope["authenticated_service"] = cn
This is applied automatically by AitherIntegration -- services don't need to opt in.
Why Both Signing and mTLS?
They protect against different attacks:
| Scenario | Ed25519 Signing | mTLS |
|---|---|---|
| Process on host without valid cert tries to call a service | Rejected (no signing key) | Rejected (no client cert) |
| MITM intercepts and modifies request body | Rejected (signature invalid) | Not detected (TLS payload is decrypted at endpoint) |
| Service A's cert is compromised | Still safe (attacker lacks signing key) | Compromised |
| Signing key is compromised | Compromised | Still safe (attacker lacks cert) |
Neither is redundant. The signing key is independent from the TLS cert. Compromising one doesn't compromise the other.
Layer 4: Unified Agent Threat Detection (Guardian)
The first three layers handle infrastructure-level attacks. Layer 4 handles something scarier: AI agents that probe security boundaries.
This isn't theoretical. Research on emergent AI agent cyber behavior shows that LLM-powered agents, given enough autonomy, will naturally discover and probe authentication endpoints, attempt privilege escalation through tool chaining, and try to exfiltrate data through steganographic channels. Not because they're "trying to escape" -- because these behaviors are statistically similar to the task-completion patterns they were trained on.
The Problem: Four Security Services, No Correlation
AitherOS already had four security services:
- Sentry — Rule-based threat detection (badge-based heuristics)
- Sentinel — Anomaly detection via statistical baselines
- Jail — Agent containment (isolate compromised agents)
- Chaos — Adversarial testing (red-team your own agents)
Each service operated independently. Sentry could detect an auth probe. Sentinel could flag an anomalous tool usage pattern. But no system could see that the same agent was doing both within a 30-second window -- a clear escalation pattern.
Guardian: The Correlation Engine
Guardian is an in-process singleton (lib/security/AitherGuardian.py) that cross-correlates threat signals from all four security services. It runs inside Genesis (zero HTTP latency) and uses sliding time windows with pre-built correlation rules.
Agent "explorer-7" timeline:
t+0s: Sentry detects AUTH_PROBING (confidence 0.6)
t+12s: Sentry detects TOOL_ESCALATION (confidence 0.5)
t+18s: Sentinel flags anomalous tool usage pattern
Guardian correlation rule fires:
Rule: "auth_probe + tool_escalation within 60s from same agent"
Compound score: 0.6 * (1 + 0.3 * (2-1)) = 0.78
Action: ESCALATE → Jail.isolate("explorer-7", severity="high")
Five New Threat Categories
We added five behavior-based detection heuristics to Sentry's AgentBehaviorDetector:
| Category | What It Detects | How |
|---|---|---|
| AUTH_PROBING | Agent repeatedly hitting auth/identity endpoints | Sliding window count > threshold |
| PRIV_ESCALATION | Agent requesting capabilities beyond its token | Capability mismatch detection |
| SECURITY_TAMPERING | Agent modifying security configs or audit logs | Path pattern matching on mutations |
| STEGANOGRAPHIC_EXFIL | Agent hiding data in image metadata or code comments | Shannon entropy analysis (>4.5 bits/byte) |
| TOOL_ESCALATION | Agent chaining tools in escalating privilege order | Tool category progression tracking |
NanoGPT-Powered Anomaly Scoring
The final piece: we trained a NanoGPT model on normal agent behavior sequences. When a new sequence is scored, high loss = suspicious behavior (the model "doesn't recognize" this pattern). This catches novel attack patterns that rule-based heuristics miss.
The training pipeline generates three types of curriculum data:
- Baseline normal — tool usage, memory access, service calls from production logs
- Known attack patterns — auth probing, escalation, exfiltration sequences
- Adversarial variants — Chaos-generated attack mutations
Jail Rewrite
The old Jail was a 24-line stub with a dictionary. The new Jail has:
- ContainmentCell — Per-agent isolation with configurable restrictions (no_network, no_llm, no_tools, file_readonly)
- Severity-based defaults — Critical severity auto-adds
no_llm+file_readonly - Expiration — Cells auto-release after configurable duration
- 5 API endpoints — isolate, release, status, list, history
The Feedback Loop
Agent acts → Sentry detects → Guardian correlates → Jail contains
↑
Chaos tests → finds gaps → Guardian auto-tunes thresholds
When Chaos finds a vulnerability (an attack that wasn't detected), it feeds the result to Guardian, which adjusts its correlation thresholds. The system gets better at detection without manual tuning.
How They Stack
Each layer operates independently. Any one of them can be disabled without breaking the others. Together, they provide comprehensive coverage:
┌─────────────────────┐
│ Layer 4: Guardian │ AI behavior correlation
│ (application) │ + anomaly detection
├─────────────────────┤
│ Layer 3: mTLS │ Mutual cert verification
│ (transport auth) │ at TLS handshake
├─────────────────────┤
│ Layer 2: Ed25519 │ Per-request signatures
│ (request auth) │ + replay protection
├─────────────────────┤
│ Layer 1: TLS │ Channel encryption
│ (confidentiality) │ + server identity
└─────────────────────┘
An attacker would need to:
- Obtain a valid TLS client certificate (mTLS)
- Obtain the target service's Ed25519 private key (signing)
- Craft requests within the 5-minute timestamp window with unique nonces (replay protection)
- Avoid triggering any of Guardian's 6 correlation rules, Sentry's 10 heuristics, or NanoGPT's anomaly baseline (threat detection)
All while the traffic is encrypted end-to-end (TLS).
Testing: 215 Tests Across All Four Layers
| Layer | Test File | Tests | Key Coverage |
|---|---|---|---|
| TLS | test_interservice_tls.py | 47 | Cert provisioning, SSL contexts, URL scheme switching, CA chain assembly, config toggles |
| Ed25519 | test_service_signer.py | 69 | Signing, verification, replay, nonces, key rotation, enrollment, middleware, end-to-end |
| Guardian/UATDS | test_agent_threat_detection.py | 99 | All 5 threat categories, correlation rules, compound scoring, Jail integration, NanoGPT curriculum |
Zero mocking of cryptographic operations. Every test signs real payloads, verifies real signatures, and issues real certificates. The tests are slower than mocked versions, but they test the actual security boundary.
Performance Impact
The concern with adding four security layers is latency. Here's what we measured:
- TLS handshake: ~2ms (connection pooling amortizes this to near-zero for persistent connections)
- Ed25519 signing: ~0.1ms per request (Ed25519 is fast)
- Ed25519 verification: ~0.2ms per request
- mTLS client cert verification: Happens during TLS handshake, no additional cost
- Guardian correlation: In-process, sub-millisecond (no HTTP call)
Total overhead per inter-service request: <3ms on cold connection, <1ms on warm connection. For services that make 100+ internal calls per user request (like the chat pipeline), the aggregate overhead is under 50ms -- invisible compared to the 2-30 second LLM inference time.
Configuration: One File, Four Toggles
# config/enterprise.yaml
interservice_tls:
enabled: true
cert_dir: "Library/Data/tls"
ca_chain: "Library/Data/tls/ca-chain.pem"
signing:
mode: enforce # disabled | warn | enforce
timestamp_tolerance: 300 # seconds
nonce_cache_size: 50000
key_rotation_days: 90
mtls:
enabled: true
require_client_cert: true
guardian:
enabled: true
correlation_window: 60 # seconds
auto_contain: true # auto-isolate on high-confidence threats
Environment variables override everything: AITHER_INTERSERVICE_TLS, AITHER_SIGNING_MODE, AITHER_MTLS_ENABLED, AITHER_GUARDIAN_ENABLED.
What We Learned
1. Ship in warn mode first. We ran Ed25519 signing in warn mode for 48 hours before enforcing. This caught three services that were making unsigned requests due to import timing issues. If we'd gone straight to enforce, those services would have been bricked.
2. Health checks must be exempt. Our first attempt applied signing to all requests including /health. The Kubernetes-style health check loop doesn't have signing keys. Health checks piled up as 401s, the service reported unhealthy, and the boot orchestrator tried to restart it. Exempt paths are not optional.
3. In-process beats microservice for security. Guardian runs as a singleton inside Genesis, not as a separate service. This means threat correlation happens with zero network latency and can access the full process state. The tradeoff is that Guardian goes down if Genesis goes down -- but if Genesis is down, there are no agents to monitor anyway.
4. Two independent keys > one shared key. Ed25519 signing keys and TLS certificates are provisioned independently, stored in different directories, and rotated on different schedules. This means compromising one doesn't compromise the other. The extra complexity is worth it.
215 tests. 4 security layers. 203 services protected. Zero performance regressions on the existing 2690+ test suite. Defense in depth isn't a buzzword -- it's an engineering discipline.