Early Access Preview
Back to blog
engineeringsecurityarchitecturecryptography

Four Layers of Defense: How We Encrypted and Authenticated Every Byte Between 203 Microservices

March 15, 202614 min readAitherium
Share

AitherOS runs 203 microservices on a single host. They communicate over HTTP on localhost. That's fast, simple, and completely insecure.

Any process on the host can sniff inter-service traffic. A compromised service can impersonate another. A rogue container can replay requests. And when your services include an LLM orchestrator that spawns AI agents with tool access, "insecure internal network" isn't a risk you defer to next quarter.

We shipped four independent security layers in one sprint: TLS transport encryption, Ed25519 request signing, mutual TLS authentication, and a NanoGPT-powered threat detection system. Each layer solves a different class of attack. Together, they make AitherOS's internal network as hardened as its external surface.

The Threat Model

Before writing code, we mapped the attack surface:

AttackPre-mitigationWhich layer blocks it
Packet sniffing on hostWide openTLS
Service impersonationWide openEd25519 signing
Stolen credential replayTime-limitedNonce + timestamp
Unauthorized service-to-service callsWide openmTLS
Compromised agent probing auth boundariesUndetectedGuardian
Privilege escalation via tool chainingUndetectedGuardian + Sentry

No single layer handles everything. That's the point.

Layer 1: TLS Transport Encryption

Problem: All inter-service HTTP traffic was plaintext on localhost.

Solution: Every service now starts with TLS. get_service_url() returns https:// when TLS is enabled. Uvicorn binds with SSL context. Clients verify the CA chain.

Certificate Architecture

We run a two-tier PKI:

AitherNet Root CA (EC P-384, 10-year validity)
  └── AitherNet Intermediate CA (EC P-384, 5-year)
       ├── aither-genesis.crt (EC P-256, 1-year)
       ├── aither-node.crt
       ├── aither-secrets.crt
       └── ... (22 services pre-provisioned)

The root CA never signs service certs directly. The intermediate handles issuance and can be rotated without rebuilding the entire chain. The CA chain file (ca-chain.pem) is auto-assembled: intermediate + root, so clients can verify the full chain.

Cert Provisioning (3-Tier Resolution)

When a service boots, TLSConfig.provision_service_cert() tries three sources in order:

  1. Already on diskLibrary/Data/tls/{service}/cert.pem + key.pem
  2. Copy from AitherSecrets CA directoryLibrary/Data/secrets/ca/issued/{service}/
  3. Request from AitherSecrets APIPOST /ca/issue/{service_name}

Most services hit tier 1 (instant). New services hit tier 3 once, then tier 1 forever.

The Config Toggle

# enterprise.yaml
interservice_tls:
  enabled: true  # default
  cert_dir: "Library/Data/tls"
  ca_chain: "Library/Data/tls/ca-chain.pem"

Env override: AITHER_INTERSERVICE_TLS=false disables it. This exists for local development and testing, not production. The toggle is checked in AitherPorts.get_service_url(), AitherIntegration.__init__() (uvicorn SSL), and BaseServiceClient.__init__() (CA verification).

Layer 2: Ed25519 Request Signing

TLS encrypts the channel, but it doesn't prove who sent the request. Any service with network access can reach any other service's HTTPS endpoint. We need per-request authentication.

Solution: Every outbound HTTP request is signed with the sending service's Ed25519 private key. Every inbound request is verified against the sender's public key.

The Signing Scheme

Payload = METHOD | path | timestamp | nonce | SHA256(body)
Signature = Ed25519.sign(payload, private_key)

Five headers carry the signature:

X-Aither-Service: AitherGenesis
X-Aither-Timestamp: 1741852800
X-Aither-Nonce: a7f3b2c4d5e6
X-Aither-Key-Id: genesis-2026-03
X-Aither-Signature: base64(ed25519_signature)

Replay Protection

Two mechanisms prevent replay attacks:

  1. Timestamp tolerance: Requests older than 5 minutes are rejected. This handles clock drift while blocking stored-and-replayed requests.
  2. Nonce tracking: A bounded LRU cache (50,000 entries) tracks recently seen nonces. Duplicate nonces within the timestamp window are rejected.

Enforcement Modes

# enterprise.yaml
signing:
  mode: enforce  # disabled | warn | enforce
  • disabled: No signing, no verification. For development.
  • warn: Sign outbound, verify inbound, log failures but accept the request. For rollout.
  • enforce: Reject unsigned or invalid requests with 401.

We ran in warn mode for 48 hours to catch any service that wasn't signing correctly, then flipped to enforce.

Key Provisioning

Services get Ed25519 keypairs from AitherSecrets at boot -- the infrastructure already existed. Keys are cached locally, Fernet-encrypted with the machine key. Auto-rotation is configurable (default 90 days) with a 24-hour overlap period where both old and new keys are valid.

Exempt Paths

Health checks, metrics, docs, and favicon are exempt. These are read-only, unauthenticated by design, and high-frequency. Signing them would add latency to the hot path with no security benefit.

Layer 3: Mutual TLS (mTLS)

TLS proves the server is who it claims to be. Signing proves the request is authentic. But mTLS goes further: the server verifies the client's certificate at the TLS handshake level, before any HTTP processing occurs.

How It Works

  1. Service A opens a TLS connection to Service B
  2. Service B's SSL context requires client certificates (ssl.CERT_REQUIRED)
  3. Service A presents its cert (same cert used for its own TLS server)
  4. Service B verifies the cert against the CA chain
  5. MTLSEnforcementMiddleware extracts the Common Name (CN) from the client cert
  6. The CN becomes the authenticated service identity for RBAC checks

The Middleware

class MTLSEnforcementMiddleware:
    async def __call__(self, scope, receive, send):
        if scope["type"] == "http":
            # Extract client cert CN from TLS transport
            transport = scope.get("transport")
            if transport:
                peercert = transport.get_extra_info("peercert")
                cn = extract_cn(peercert)
                scope["authenticated_service"] = cn

This is applied automatically by AitherIntegration -- services don't need to opt in.

Why Both Signing and mTLS?

They protect against different attacks:

ScenarioEd25519 SigningmTLS
Process on host without valid cert tries to call a serviceRejected (no signing key)Rejected (no client cert)
MITM intercepts and modifies request bodyRejected (signature invalid)Not detected (TLS payload is decrypted at endpoint)
Service A's cert is compromisedStill safe (attacker lacks signing key)Compromised
Signing key is compromisedCompromisedStill safe (attacker lacks cert)

Neither is redundant. The signing key is independent from the TLS cert. Compromising one doesn't compromise the other.

Layer 4: Unified Agent Threat Detection (Guardian)

The first three layers handle infrastructure-level attacks. Layer 4 handles something scarier: AI agents that probe security boundaries.

This isn't theoretical. Research on emergent AI agent cyber behavior shows that LLM-powered agents, given enough autonomy, will naturally discover and probe authentication endpoints, attempt privilege escalation through tool chaining, and try to exfiltrate data through steganographic channels. Not because they're "trying to escape" -- because these behaviors are statistically similar to the task-completion patterns they were trained on.

The Problem: Four Security Services, No Correlation

AitherOS already had four security services:

  • Sentry — Rule-based threat detection (badge-based heuristics)
  • Sentinel — Anomaly detection via statistical baselines
  • Jail — Agent containment (isolate compromised agents)
  • Chaos — Adversarial testing (red-team your own agents)

Each service operated independently. Sentry could detect an auth probe. Sentinel could flag an anomalous tool usage pattern. But no system could see that the same agent was doing both within a 30-second window -- a clear escalation pattern.

Guardian: The Correlation Engine

Guardian is an in-process singleton (lib/security/AitherGuardian.py) that cross-correlates threat signals from all four security services. It runs inside Genesis (zero HTTP latency) and uses sliding time windows with pre-built correlation rules.

Agent "explorer-7" timeline:
  t+0s:  Sentry detects AUTH_PROBING (confidence 0.6)
  t+12s: Sentry detects TOOL_ESCALATION (confidence 0.5)
  t+18s: Sentinel flags anomalous tool usage pattern

Guardian correlation rule fires:
  Rule: "auth_probe + tool_escalation within 60s from same agent"
  Compound score: 0.6 * (1 + 0.3 * (2-1)) = 0.78
  Action: ESCALATE → Jail.isolate("explorer-7", severity="high")

Five New Threat Categories

We added five behavior-based detection heuristics to Sentry's AgentBehaviorDetector:

CategoryWhat It DetectsHow
AUTH_PROBINGAgent repeatedly hitting auth/identity endpointsSliding window count > threshold
PRIV_ESCALATIONAgent requesting capabilities beyond its tokenCapability mismatch detection
SECURITY_TAMPERINGAgent modifying security configs or audit logsPath pattern matching on mutations
STEGANOGRAPHIC_EXFILAgent hiding data in image metadata or code commentsShannon entropy analysis (>4.5 bits/byte)
TOOL_ESCALATIONAgent chaining tools in escalating privilege orderTool category progression tracking

NanoGPT-Powered Anomaly Scoring

The final piece: we trained a NanoGPT model on normal agent behavior sequences. When a new sequence is scored, high loss = suspicious behavior (the model "doesn't recognize" this pattern). This catches novel attack patterns that rule-based heuristics miss.

The training pipeline generates three types of curriculum data:

  1. Baseline normal — tool usage, memory access, service calls from production logs
  2. Known attack patterns — auth probing, escalation, exfiltration sequences
  3. Adversarial variants — Chaos-generated attack mutations

Jail Rewrite

The old Jail was a 24-line stub with a dictionary. The new Jail has:

  • ContainmentCell — Per-agent isolation with configurable restrictions (no_network, no_llm, no_tools, file_readonly)
  • Severity-based defaults — Critical severity auto-adds no_llm + file_readonly
  • Expiration — Cells auto-release after configurable duration
  • 5 API endpoints — isolate, release, status, list, history

The Feedback Loop

Agent acts → Sentry detects → Guardian correlates → Jail contains
                                    ↑
Chaos tests → finds gaps → Guardian auto-tunes thresholds

When Chaos finds a vulnerability (an attack that wasn't detected), it feeds the result to Guardian, which adjusts its correlation thresholds. The system gets better at detection without manual tuning.

How They Stack

Each layer operates independently. Any one of them can be disabled without breaking the others. Together, they provide comprehensive coverage:

                     ┌─────────────────────┐
                     │   Layer 4: Guardian  │  AI behavior correlation
                     │   (application)      │  + anomaly detection
                     ├─────────────────────┤
                     │   Layer 3: mTLS      │  Mutual cert verification
                     │   (transport auth)   │  at TLS handshake
                     ├─────────────────────┤
                     │   Layer 2: Ed25519   │  Per-request signatures
                     │   (request auth)     │  + replay protection
                     ├─────────────────────┤
                     │   Layer 1: TLS       │  Channel encryption
                     │   (confidentiality)  │  + server identity
                     └─────────────────────┘

An attacker would need to:

  1. Obtain a valid TLS client certificate (mTLS)
  2. Obtain the target service's Ed25519 private key (signing)
  3. Craft requests within the 5-minute timestamp window with unique nonces (replay protection)
  4. Avoid triggering any of Guardian's 6 correlation rules, Sentry's 10 heuristics, or NanoGPT's anomaly baseline (threat detection)

All while the traffic is encrypted end-to-end (TLS).

Testing: 215 Tests Across All Four Layers

LayerTest FileTestsKey Coverage
TLStest_interservice_tls.py47Cert provisioning, SSL contexts, URL scheme switching, CA chain assembly, config toggles
Ed25519test_service_signer.py69Signing, verification, replay, nonces, key rotation, enrollment, middleware, end-to-end
Guardian/UATDStest_agent_threat_detection.py99All 5 threat categories, correlation rules, compound scoring, Jail integration, NanoGPT curriculum

Zero mocking of cryptographic operations. Every test signs real payloads, verifies real signatures, and issues real certificates. The tests are slower than mocked versions, but they test the actual security boundary.

Performance Impact

The concern with adding four security layers is latency. Here's what we measured:

  • TLS handshake: ~2ms (connection pooling amortizes this to near-zero for persistent connections)
  • Ed25519 signing: ~0.1ms per request (Ed25519 is fast)
  • Ed25519 verification: ~0.2ms per request
  • mTLS client cert verification: Happens during TLS handshake, no additional cost
  • Guardian correlation: In-process, sub-millisecond (no HTTP call)

Total overhead per inter-service request: <3ms on cold connection, <1ms on warm connection. For services that make 100+ internal calls per user request (like the chat pipeline), the aggregate overhead is under 50ms -- invisible compared to the 2-30 second LLM inference time.

Configuration: One File, Four Toggles

# config/enterprise.yaml
interservice_tls:
  enabled: true
  cert_dir: "Library/Data/tls"
  ca_chain: "Library/Data/tls/ca-chain.pem"

signing:
  mode: enforce        # disabled | warn | enforce
  timestamp_tolerance: 300  # seconds
  nonce_cache_size: 50000
  key_rotation_days: 90

mtls:
  enabled: true
  require_client_cert: true

guardian:
  enabled: true
  correlation_window: 60   # seconds
  auto_contain: true       # auto-isolate on high-confidence threats

Environment variables override everything: AITHER_INTERSERVICE_TLS, AITHER_SIGNING_MODE, AITHER_MTLS_ENABLED, AITHER_GUARDIAN_ENABLED.

What We Learned

1. Ship in warn mode first. We ran Ed25519 signing in warn mode for 48 hours before enforcing. This caught three services that were making unsigned requests due to import timing issues. If we'd gone straight to enforce, those services would have been bricked.

2. Health checks must be exempt. Our first attempt applied signing to all requests including /health. The Kubernetes-style health check loop doesn't have signing keys. Health checks piled up as 401s, the service reported unhealthy, and the boot orchestrator tried to restart it. Exempt paths are not optional.

3. In-process beats microservice for security. Guardian runs as a singleton inside Genesis, not as a separate service. This means threat correlation happens with zero network latency and can access the full process state. The tradeoff is that Guardian goes down if Genesis goes down -- but if Genesis is down, there are no agents to monitor anyway.

4. Two independent keys > one shared key. Ed25519 signing keys and TLS certificates are provisioned independently, stored in different directories, and rotated on different schedules. This means compromising one doesn't compromise the other. The extra complexity is worth it.


215 tests. 4 security layers. 203 services protected. Zero performance regressions on the existing 2690+ test suite. Defense in depth isn't a buzzword -- it's an engineering discipline.

Enjoyed this post?
Share