Early Access Preview
Back to blog
engineeringsecurityarchitectureapps

Bulletproof: How AitherOS Treats Every Community App Like an Untrusted Binary

March 5, 202614 min readAitherium
Share

When we shipped the AitherOS Community App Package Manager, the first question from the team wasn't "does it work?" — it was "what happens when someone installs something malicious?"

Fair question. The AitherOS app ecosystem is designed to pull from open-source collections like awesome-llm-apps by Shubham Saboo — a phenomenal curated collection of 100+ LLM apps with AI Agents, RAG, MCP, voice agents, and multi-agent teams, built with OpenAI, Anthropic, Google, and open-source models. With nearly 100k GitHub stars and hundreds of community-contributed apps spanning everything from AI travel agents to autonomous game-playing bots, it's exactly the kind of rich, diverse ecosystem we want AitherOS users to tap into. We owe a huge thank you to Shubham and every contributor to that repo — it's the foundation our app catalog is built on.

But that diversity is also the challenge. We're giving users the ability to install arbitrary Python code from GitHub, build it into Docker containers, and run it alongside the operating system's many internal services. That's a massive attack surface if you don't treat every installation like loading an unknown binary onto a production server.

So we built a security system that does exactly that. Every community app goes through a pipeline that would make a Linux kernel security module blush: behavioral sandbox probing, automated minimum-privilege policy generation, seccomp syscall filtering, iptables-style firewall rules, AppArmor-style filesystem ACLs, cgroup resource limits, policy versioning with rollback, full audit trails, and a caller isolation layer that puts a hard network and identity boundary between community apps and internal AitherOS services.

This is the full story of how it works.

The Threat Model

Community apps come from GitHub repos. They could be:

  • Benign — a Streamlit dashboard that reads an API and shows charts
  • Overprivileged — a FastAPI service that opens unnecessary network connections and reads environment variables it doesn't need
  • Malicious — something that tries to exfiltrate secrets, escalate privileges, or pivot into the internal service mesh

The goal is simple: apps should get exactly the permissions they need, nothing more, with every deviation recorded and enforced. The system should be able to go from "never seen this app before" to "here's a minimum-privilege policy" without human intervention for low-risk apps, while flagging anything suspicious for manual review.

Stage 1: The Onboarding Pipeline

When you trigger an app installation, the app doesn't just get cloned and pip-installed. It enters a six-stage onboarding pipeline:

SBOM Generation

The first thing we do is build a Software Bill of Materials. We parse requirements.txt line by line and scan Python imports across all source files. Every dependency is cross-referenced against a known-risk database — pyautogui, keyboard, pynput are high-risk (input capture); paramiko, fabric are medium (SSH access); requests, httpx are low (HTTP clients). The SBOM records every package, its version, its risk tier, and its source (requirements file vs. import scan).

Static Risk Scan

Next, we scan every .py file for 35+ risky code patterns using regex:

SeverityWhat We Flag
Criticaleval(), exec(), os.setuid(), os.chroot(), os.mount(), os.mknod()
Highsubprocess.*(), pickle.loads(), __import__(), os.system(), ctypes, paramiko, telnetlib
Mediumos.environ, socket.socket(), os.chmod(), shutil.rmtree(), smtplib, mmap
Lowrequests.get(), webbrowser.open(), threading.*, multiprocessing.*

Each finding records the file, line number, pattern, and severity. The aggregate score is weighted — a single eval() hits harder than ten requests.get() calls.

Sandbox Probe

This is where it gets serious. If the app has a Docker image, we launch it in a maximally locked-down container and watch everything it does for 30 seconds:

docker run -d \
  --network=none \
  --read-only \
  --tmpfs /tmp:rw,size=64m \
  --cap-drop=ALL \
  --security-opt=no-new-privileges \
  --pids-limit=64 \
  --memory=256m \
  --cpus=0.25 \
  $IMAGE

No network. Read-only filesystem. All Linux capabilities dropped. PID and memory capped. Then we collect:

  • Process tree via docker top — every PID, PPID, command, and user
  • Open file descriptors from /proc/1/fd/ — what files the process has open
  • Network connections from /proc/net/tcp, /proc/net/tcp6, /proc/net/udp — parsed from hex with IPv4 and IPv6 support
  • Filesystem changes via docker diff — every file added, changed, or deleted
  • Resource usage from docker stats — peak memory, CPU percentage, PID count, block I/O
  • Container inspection via docker inspect — exit code, OOM-killed status
  • Memory maps from /proc/1/maps — which shared libraries and executables are loaded
  • Environment variables from /proc/1/environ — which env vars the process can see
  • Container logs — stdout and stderr samples

The probe result is a comprehensive data structure with typed records for every observation: syscall traces, file access records, network connections, process records, capability attempts, and resource usage.

Sentinel Verdict

The probe telemetry is submitted to AitherSentinel — the five-stage threat detection pipeline (Bloom filter, feature extraction, anomaly scoring, behavioral baseline, quarantine). Sentinel classifies the app as benign, suspicious, or threat. If Sentinel isn't running, we fall back to local scoring.

Risk Aggregation

The overall risk score is a weighted blend:

SourceWeight
Static analysis35%
SBOM risk15%
Probe behavior35%
CVE findings15%

Apps scoring below 0.3 with a benign Sentinel verdict are auto-approved. Above 0.6 or with a threat verdict: auto-denied. Everything in between: queued for human review.

Policy Recommendation

Here's the core of the "model Linux" approach. The recommendation engine analyzes the probe telemetry and generates a minimum-privilege policy — every permission is derived from what the app actually did during the probe, nothing more:

  • Filesystem ACLs — only paths the app touched get access, filtered for system noise (/proc, /sys, /usr/lib). Read-only for paths only read, read-write for paths written, execute for loaded binaries.
  • Firewall rules — iptables-style. Each observed egress connection becomes an OUTPUT ALLOW rule with specific destination and port. Each listening port becomes an INPUT ALLOW. DNS allowed only if the app makes network connections. Default: DROP on both chains.
  • Seccomp syscall filter — starts with a Python baseline of ~50 syscalls. Adds network syscalls only if the app needs networking. Adds any observed non-dangerous syscalls. Never adds ptrace, mount, chroot, reboot, init_module, or kexec_load.
  • Linux capabilities — default: none. Only adds capabilities the app demonstrably needs, and never adds CAP_SYS_ADMIN, CAP_SYS_PTRACE, or CAP_SYS_MODULE.
  • Cgroup limits — memory set to 1.5x observed peak (capped at 4096MB), CPU scaled from observed usage, PID limit at 2x observed peak.
  • Environment whitelist — safe defaults (PATH, HOME, LANG, TZ, PYTHONPATH) plus any the app actually read, minus anything starting with AITHER_.

Stage 2: The Policy Model

The recommended policy feeds into a comprehensive data model that captures Linux OS-level security controls:

App Policy
├── Identity & Provenance (source_repo, source_hash, installed_version)
├── Lifecycle (status, reviewed_by, reviewed_at, expires_at, ttl_days)
├── Network (firewall_rules[], allowed_egress_hosts[], allowed_ports[], dns_allowed)
├── Filesystem (filesystem_rules[], read_only_root, tmpfs_size_mb, volume_mounts)
├── Seccomp (seccomp_syscalls[], seccomp_default_action)
├── Linux Capabilities (linux_capabilities[])
├── Cgroup Limits (max_memory_mb, max_cpus, max_pids, max_io_read/write_mbps, oom_score_adj)
├── Environment (allow_env_vars[], blocked_env_vars[], inject_env{})
├── IPC (ipc_mode, allow_shared_memory)
├── User Namespace (run_as_user, run_as_uid, run_as_gid)
├── Device Access (allowed_devices[])
├── Agent Ecosystem (a2a_enabled, tool_enabled, capabilities[])
├── Trust & Risk (risk_score, risk_flags[], trusted, trust_reason)
├── Counters (violation_count, launch_count, last_launched)
└── Versioning (policy_version, previous_snapshot{})

Every field has a purpose. When the app is launched, the policy is translated into Docker run arguments -- typically 15-25 flags covering network, memory, CPU, PIDs, capabilities, OOM scoring, IO bandwidth, user namespaces, devices, and volumes. A seccomp profile generator produces a Docker-compatible JSON profile that gets written to disk and applied at container launch.

Policy Lifecycle

Policies follow a strict state machine:

pending_review → approved → suspended → approved (re-approve)
               → denied                → expired (TTL exceeded)
denied → pending_review (re-submit)
expired → pending_review (re-review)

Every transition is validated — you can't go from denied to expired or from expired to suspended. Invalid transitions are blocked with an error, not silently accepted.

Versioning and Rollback

Every mutation to a policy:

  1. Takes a snapshot of the current state
  2. Applies the change
  3. Validates the result (checks status, network mode, memory limits, CPU, PID limits, IPC mode, seccomp action)
  4. Bumps the policy_version counter
  5. Writes an audit entry with the actor, old status, new status, changes dict, and policy version

If validation fails after a modify, the policy is automatically rolled back to the snapshot. Admins can also manually trigger a rollback via the API. The diff between the current state and the previous snapshot is also available for inspection.

Atomic Persistence

Policy files are written atomically — temp file, write, os.replace(). A crash mid-save can't corrupt the policy store. Concurrent writes are serialized with a threading lock.

Stage 3: The Rules Engine

Eight built-in rules evaluate automatically, with side effects:

RuleConditionAction
block_critical_static3+ critical static findingsAuto-deny
block_high_riskOverall risk >= 0.7Auto-deny
flag_medium_riskRisk 0.4–0.7Flag for mandatory review
suspend_on_probe_crashApp crashed during probeAuto-deny
downgrade_network_if_unusedZero network connections in probeForce network_access=none
auto_approve_low_riskRisk < 0.15, no high/critical findingsAuto-approve
expire_stale_reviewsDays since review >= TTLAuto-expire
block_no_licenseNo LICENSE fileFlag

Rules are declarative — condition-action pairs with arbitrary key comparisons (_gte, _lt, _eq, _ne). Custom rules can be added at runtime via the API. The engine runs on a 30-minute background cycle across all approved apps.

Stage 4: Caller Isolation

This is the layer that prevents a community app from "calling home" to AitherOS internal services to approve itself, read secrets, or escalate privileges.

Six Caller Types

TypeAuthenticationAccess Level
SYSTEMCryptographically signed service headerFULL — unrestricted
AGENTAgent identity headerAGENT_OPS — launch, stop, query
ADMINBearer token + admin roleADMIN — all CRUD + policy management
COMMUNITY_APPScoped cryptographic tokenSELF_ONLY — own status, own policy
EXTERNAL_APIAPI keyREAD_ONLY — catalog browsing
ANONYMOUSNoneNONE — health check only

Scoped App Tokens

When an app is approved, it receives a cryptographically signed token scoped to its app identity. This token lets the app query its own status and policy — and nothing else. It can't read other apps' data. It can't invoke approval, denial, suspension, or any policy management operation. It can't access audit logs, violations across apps, or security previews.

When the app is denied or suspended, the token is immediately revoked. The next API call fails.

Network Fence

Community app containers run on an isolated Docker bridge network created with the --internal flag. Firewall rules block access to every internal AitherOS service -- the orchestrator, the secrets vault, the identity service, the event logger, the mesh coordinator, threat detection, the analytics pipeline, and the model scheduler are all unreachable.

The only door in is the AppStore gateway, and that door checks the app's cryptographic token on every request. A community app literally cannot establish a TCP connection to any internal service.

Endpoint Guard Coverage

44 of 52 endpoints have explicit caller guards. The 8 unguarded endpoints are intentionally public: health check, catalog browsing, template listing. Every mutation endpoint — install, uninstall, update, approve, deny, suspend, modify, rollback, export, import — requires admin or system identity.

Stage 5: Observability

Every significant event emits a Flux event and writes structured logs:

  • policy.approve, policy.deny, policy.suspend, policy.modify, policy.rollback
  • policy.violation — with severity, rule ID, and action taken
  • app.onboarded — with risk score, verdict, template, finding counts
  • app.launched, app.stopped — with port, isolation mode, policy flag count
  • app.launch_failed — if container health check fails 3 seconds after launch

Violations are auto-escalated: 3+ critical violations auto-suspend the app. Fatal violations suspend immediately. The violation log is append-only JSONL with flush-on-write.

Audit entries include actor, action, previous/new status, changes dict, reason, correlation ID, and policy version — making it possible to reconstruct the complete history of every policy decision.

Background automation runs on two cycles:

  • Every 15 minutes: check all approved policies for TTL expiry, auto-expire stale ones, stop any running expired apps
  • Every 30 minutes: evaluate the rules engine against all approved apps, auto-stop apps that trigger deny/suspend rules

The Full Flow

Install → Build Docker image → Sandbox probe (30s, capture everything)
  → Analyze behavior → Recommend minimum-privilege policy
  → Auto-approve (low risk) or require admin review
  → Issue scoped app token
  → Enforce at real launch with seccomp + firewall + filesystem ACLs + cgroup limits
  → Continuous monitoring: background rule evaluation, violation tracking, auto-suspension
  → Token revocation on deny/suspend

Every step is audited. Every policy change is versioned. Every enforcement boundary is validated. Every caller is identified.

That's what it takes to let users install arbitrary code from the internet and sleep at night.


Open Source Acknowledgments

The AitherOS community app ecosystem wouldn't exist without the incredible work of open-source contributors. In particular:

  • awesome-llm-apps by Shubham Saboo — The curated collection of 100+ LLM apps with AI Agents, RAG, MCP, voice agents, and multi-agent teams that forms the backbone of our app catalog. With nearly 100k stars and contributions from hundreds of developers, it's one of the most valuable resources in the LLM application space. Licensed under Apache 2.0. If you're building anything with LLMs, go star it.

  • The broader open-source AI community building Streamlit, Gradio, FastAPI, LangChain, CrewAI, and the countless frameworks that make these apps possible.

We built the security system. They built the apps worth securing.

Enjoyed this post?
Share