Early Access Preview
Back to blog
engineeringsecurityarchitecturenetworking

AitherFirewall: How We Built a Host-Based Stateful Firewall for an Agent OS

March 10, 202612 min readAitherium
Share

AitherOS runs dozens of microservices across 12 architectural layers. Agents spawn subagents. Services talk to services. LLM calls flow through the model scheduler. And the entire mesh sits behind a single Cloudflare tunnel or a local Tailscale network. The question we kept dodging was simple: who decides which IPs can talk to which ports?

We had AitherBastion doing Layer 7 reverse proxying. We had AitherSentry doing real-time threat detection. We had AitherGuard running red team exercises. What we didn't have was a proper Layer 3/4 firewall — the foundational piece that says "this IP, in this zone, gets to reach this service, at this rate, period."

This is the engineering story of AitherFirewall.

The Gap in the Stack

Our security stack is deep. Layer 8 in the AitherOS architecture has 7 services that each handle a different facet of defense. But if you traced a request from the internet to, say, the secrets vault, the path looked like this:

Internet → Cloudflare → Bastion (reverse proxy) → Secrets

Bastion does TLS termination, path routing, and basic WAF. But Bastion is a Layer 7 proxy — it understands HTTP requests, not TCP connections. If someone port-scanned your host across the service port range, Sentry would detect the scan and create a block rule in AitherNet. But the connections themselves still landed on the ports. There was no enforcement point that said "deny all traffic from 203.0.113.0/24 to anything except port 443" before the request even reached a service.

Docker's published port mappings are just NAT rules. iptables on the host is an option, but it's not service-aware, it doesn't know about AitherOS zones, and it can't react in real time to Sentry threat events.

We needed a firewall that speaks the language of the AitherOS mesh.

Zone-Based Architecture

Every IP that touches AitherOS belongs to exactly one zone. Zone classification happens by matching the IP against CIDR ranges, evaluated top-to-bottom:

ZoneCIDR RangesWhat Lives Here
loopback127.0.0.0/8, ::1/128The host itself
docker172.17.0.0/16, 172.18.0.0/16, 172.19.0.0/16Docker bridge networks
internal10.0.0.0/8, 192.168.0.0/16, 172.16.0.0/12LAN, Tailscale peers
management(your admin IPs)SSH, monitoring
cloudflare173.245.48.0/20, 103.21.244.0/22, ...Cloudflare edge IPs
external0.0.0.0/0Everything else

Zone classification is the first thing that happens when a packet arrives. It costs about 2-5 microseconds — we're matching against ipaddress.ip_network objects, not regex.

Why zones? Because the rules become readable:

rules:
  - name: docker_mesh_allow
    zone: docker
    action: allow
    description: "Services within Docker mesh can talk to each other"

  - name: external_deny_sensitive
    zone: external
    dest_services: [Secrets, SecurityCore, Strata]
    action: deny
    description: "Never expose secrets to the internet"

  - name: cloudflare_web_only
    zone: cloudflare
    dest_services: [Veil, Node, Bastion]
    action: allow
    description: "Cloudflare can reach web-facing services only"

Rules evaluate top-to-bottom, first match wins. If no rule matches, the default policy applies (deny by default).

The Evaluation Pipeline

When AitherFirewall evaluates a connection attempt, it runs through six stages:

IP → Zone Classification → IP Reputation Check → GeoIP Check →
     Rule Matching → Rate Limit Check → Connection Tracking → Verdict

Stage 1: Zone Classification. Map IP to zone via CIDR matching. Loopback is always allowed immediately — we don't firewall the host from itself.

Stage 2: IP Reputation. Check if this IP has accumulated enough threat events to be auto-blocked. The reputation system uses configurable escalation: one threat bumps the score, N threats over a threshold trigger a temporary block, and repeated blocks escalate to permanent block. All driven by Flux events from Sentry and Sentinel.

Stage 3: GeoIP. Optional MaxMind GeoLite2 integration. If a country code is in the block list, the request is denied before rule evaluation even starts. Useful for networks that have no business receiving traffic from certain regions.

Stage 4: Rule Matching. Walk the ordered rule list. Match on zone, destination port, destination service name (resolved via the service registry), source CIDR, path prefix, protocol, and time window. First rule that matches determines the action.

Stage 5: Rate Limiting. If the matched rule has a rate_limit_rps setting, check the token bucket for this source IP. The token bucket refills at the configured rate. If the bucket is empty, the connection is rate-limited.

Stage 6: Connection Tracking. Record the connection in the state table. If this IP has exceeded the per-IP connection limit, or if SYN flood detection triggers (too many connections per second from one source), deny the connection.

The entire pipeline evaluates in under 100 microseconds for typical rules sets. We measured 50-80µs average in production with 12 rules.

Auto-Block Escalation: Where Sentry Meets Firewall

The most interesting part of the firewall isn't the rule engine — it's the feedback loop with Sentry.

AitherSentry detects threats: port scans, injection attacks, brute force, anomalous connections. It emits SENTRY_PORT_SCAN, SENTRY_INJECTION, SENTRY_THREAT, and similar events through AitherFlux. AitherSentinel detects host-level threats: ransomware, rootkits, data exfiltration. It emits SENTINEL_ANOMALY_DETECTED events.

AitherFirewall subscribes to these Flux events. When a SENTRY_PORT_SCAN event arrives with a source IP, the firewall's IP reputation manager records a threat. If the same IP accumulates enough threat events (configurable threshold, default 5), the reputation manager auto-blocks it.

The escalation path looks like this:

1 threat event    → score +1, no action
5 threat events   → score reaches threshold → TEMP_BLOCK (15 min)
Block expires     → score decays over time
5 more events     → TEMP_BLOCK again
3 temp blocks     → PERMANENT_BLOCK

When a permanent block is created, the firewall emits its own FIREWALL_IP_BLOCKED event through Flux, which Sentry and the audit dashboard can consume. The whole loop is autonomous — no human in the middle unless you want one.

The Missing Integration: Bidirectional Communication

The initial implementation had the firewall subscribing to Sentry/Sentinel events via Flux. But the integration was one-directional. Sentry didn't know about the firewall's verdicts, and Sentinel didn't know when the firewall blocked an IP that was also triggering host-level alerts.

We closed this gap by adding direct API integration:

Sentry → Firewall. When Sentry detects a HIGH or CRITICAL severity threat and decides to block a source, it now POSTs to the firewall's /firewall/block endpoint in addition to pushing a DENY rule to AitherNet. This gives the firewall immediate enforcement instead of waiting for the Flux event to propagate.

Firewall → Sentry. When the firewall blocks an IP (from its own rule evaluation, not from a Sentry event), it reports the block back to Sentry via POST /sentry/threats. This feeds Sentry's correlation engine — if an IP is being blocked by the firewall for rate limiting, and Sentry independently sees injection attempts from the same IP, the correlation strengthens the case for escalation.

Sentinel → Firewall. When Sentinel detects a host-level threat originating from a specific IP (suspicious outbound connection, data exfiltration attempt), it now notifies the firewall to block that IP immediately. The host-level detection feeds back into network-level enforcement.

Identity → Firewall. Authenticated users with valid Identity tokens get tagged in the firewall's evaluation. An authenticated internal user hitting a service behind a restrictive rule can pass through if the rule allows authenticated: true. This connects Identity's RBAC system with the firewall's zone-based access control.

Hot-Reloadable Configuration

The firewall rules live in a dedicated YAML configuration file. The file is watched via SHA256 hash comparison every 30 seconds. When the hash changes, the rules are atomically replaced — no restart, no downtime.

settings:
  default_policy: deny
  log_denied: true
  log_allowed: false
  max_connections_per_ip: 500
  syn_flood_threshold: 50
  connection_idle_timeout: 300
  rule_reload_interval: 30

This is critical for an agent OS. You don't want to restart a firewall that's protecting dozens of containers just because you need to add a rule for a new service. And in emergency scenarios — active attack — an operator can add a block rule to the YAML and have it enforced within 30 seconds.

Service-Name Resolution

One of the most useful features is service-name resolution in rules. Instead of writing raw port numbers, you can write:

dest_services: [Secrets, SecurityCore, SecurityDefense, Strata]

The firewall resolves service names to ports using the same service registry that every other AitherOS service uses. If a port changes in the registry, the firewall picks it up on the next reload. No rule updates needed.

Middleware Mode

Not every service wants to run a full firewall evaluation for every request. For lighter-weight integration, we provide firewall middleware — a FastAPI middleware that any AitherOS service can add with a single function call. The middleware extracts the client IP from request headers (respecting X-Forwarded-For, CF-Connecting-IP, etc.), calls the firewall's evaluation endpoint, and returns 403 if the verdict is denied. Internal IPs (Docker bridge, loopback) bypass the check entirely.

For services that can't reach the firewall (startup ordering, network issues), the middleware fails open — it logs a warning but doesn't block traffic. Security should never cause an outage worse than the threat it's preventing.

Audit Trail

Every firewall decision is logged to a JSONL file in the Chronicle audit directory. The format follows OCSF (Open Cybersecurity Schema Framework) conventions:

{
  "timestamp": "2026-03-10T14:32:11Z",
  "verdict": "denied",
  "rule_name": "external_deny_sensitive",
  "zone": "external",
  "source_ip": "203.0.113.42",
  "dest_service": "Secrets",
  "reason": "external zone blocked from Secrets",
  "latency_us": 67.3
}

Files rotate at 50MB. The audit trail feeds into Strata's ingestion pipeline for long-term analysis and training data generation.

What We Measured

After deploying to staging (ring 1), we measured:

  • Evaluation latency: 50-80µs average, 200µs p99
  • Memory footprint: 40MB base, ~80MB with 10K active connections tracked
  • Rule reload time: <5ms for 12 rules
  • False positive rate: 0% in 72 hours of staging traffic (zone-based rules are deterministic)
  • Auto-blocks from Sentry: 3 in 72 hours (port scanners)

The latency number matters. At 50-80µs per evaluation, adding firewall checks to every service request costs less than a single DNS lookup. The bottleneck in the AitherOS request path is still the LLM inference call at 200-2000ms — the firewall is rounding error.

Architecture Diagram

The firewall sits at the entry point of the AitherOS security stack:

External Traffic
      │
      ▼
┌─────────────────┐
│ AitherFirewall   │ ← Zone classification, IP reputation, rate limiting
│                  │ ← Flux events from Sentry/Sentinel
└───────┬─────────┘
        │ ALLOW only
        ▼
┌─────────────────┐
│ AitherBastion    │ ← TLS termination, path routing, WAF
│                  │
└───────┬─────────┘
        │
        ▼
┌─────────────────┐
│ AitherGateway    │ ← API gateway, auth validation
│                  │
└───────┬─────────┘
        │
        ▼
┌─────────────────┐    ┌─────────────────┐
│ Service Mesh     │◄──►│ AitherSentry     │ ← Threat detection
│ 60+ containers   │    │                  │    (feeds back to Firewall)
└─────────────────┘    └─────────────────┘

What's Next

The firewall is deployed and enforcing. The immediate next steps:

  1. GeoIP database integration — we ship with MaxMind GeoLite2 support but haven't enabled it in production yet.
  2. Kernel-level enforcement — the current firewall evaluates at the application layer via middleware. For true packet-level enforcement, we need eBPF programs or iptables rule generation from the firewall's YAML config.
  3. Dynamic rules from Guard — when AitherGuard's red team exercises find that a service is vulnerable to a specific attack vector, auto-generate a firewall rule to block that vector at the network level while the fix is deployed.
  4. Mesh-wide policy — currently the firewall runs on a single host. For multi-node AitherOS deployments, the firewall rules need to synchronize across nodes via AitherNet's policy engine.

All 53 tests pass.

Enjoyed this post?
Share