Bulletproof: How AitherOS Treats Every Community App Like an Untrusted Binary
When we shipped the AitherOS Community App Package Manager, the first question from the team wasn't "does it work?" — it was "what happens when someone installs something malicious?"
Fair question. The AitherOS app ecosystem is designed to pull from open-source collections like awesome-llm-apps by Shubham Saboo — a phenomenal curated collection of 100+ LLM apps with AI Agents, RAG, MCP, voice agents, and multi-agent teams, built with OpenAI, Anthropic, Google, and open-source models. With nearly 100k GitHub stars and hundreds of community-contributed apps spanning everything from AI travel agents to autonomous game-playing bots, it's exactly the kind of rich, diverse ecosystem we want AitherOS users to tap into. We owe a huge thank you to Shubham and every contributor to that repo — it's the foundation our app catalog is built on.
But that diversity is also the challenge. We're giving users the ability to install arbitrary Python code from GitHub, build it into Docker containers, and run it alongside the operating system's many internal services. That's a massive attack surface if you don't treat every installation like loading an unknown binary onto a production server.
So we built a security system that does exactly that. Every community app goes through a pipeline that would make a Linux kernel security module blush: behavioral sandbox probing, automated minimum-privilege policy generation, seccomp syscall filtering, iptables-style firewall rules, AppArmor-style filesystem ACLs, cgroup resource limits, policy versioning with rollback, full audit trails, and a caller isolation layer that puts a hard network and identity boundary between community apps and internal AitherOS services.
This is the full story of how it works.
The Threat Model
Community apps come from GitHub repos. They could be:
- Benign — a Streamlit dashboard that reads an API and shows charts
- Overprivileged — a FastAPI service that opens unnecessary network connections and reads environment variables it doesn't need
- Malicious — something that tries to exfiltrate secrets, escalate privileges, or pivot into the internal service mesh
The goal is simple: apps should get exactly the permissions they need, nothing more, with every deviation recorded and enforced. The system should be able to go from "never seen this app before" to "here's a minimum-privilege policy" without human intervention for low-risk apps, while flagging anything suspicious for manual review.
Stage 1: The Onboarding Pipeline
When you trigger an app installation, the app doesn't just get cloned and pip-installed. It enters a six-stage onboarding pipeline:
SBOM Generation
The first thing we do is build a Software Bill of Materials. We parse requirements.txt line by line and scan Python imports across all source files. Every dependency is cross-referenced against a known-risk database — pyautogui, keyboard, pynput are high-risk (input capture); paramiko, fabric are medium (SSH access); requests, httpx are low (HTTP clients). The SBOM records every package, its version, its risk tier, and its source (requirements file vs. import scan).
Static Risk Scan
Next, we scan every .py file for 35+ risky code patterns using regex:
| Severity | What We Flag |
|---|---|
| Critical | eval(), exec(), os.setuid(), os.chroot(), os.mount(), os.mknod() |
| High | subprocess.*(), pickle.loads(), __import__(), os.system(), ctypes, paramiko, telnetlib |
| Medium | os.environ, socket.socket(), os.chmod(), shutil.rmtree(), smtplib, mmap |
| Low | requests.get(), webbrowser.open(), threading.*, multiprocessing.* |
Each finding records the file, line number, pattern, and severity. The aggregate score is weighted — a single eval() hits harder than ten requests.get() calls.
Sandbox Probe
This is where it gets serious. If the app has a Docker image, we launch it in a maximally locked-down container and watch everything it does for 30 seconds:
docker run -d \
--network=none \
--read-only \
--tmpfs /tmp:rw,size=64m \
--cap-drop=ALL \
--security-opt=no-new-privileges \
--pids-limit=64 \
--memory=256m \
--cpus=0.25 \
$IMAGE
No network. Read-only filesystem. All Linux capabilities dropped. PID and memory capped. Then we collect:
- Process tree via
docker top— every PID, PPID, command, and user - Open file descriptors from
/proc/1/fd/— what files the process has open - Network connections from
/proc/net/tcp,/proc/net/tcp6,/proc/net/udp— parsed from hex with IPv4 and IPv6 support - Filesystem changes via
docker diff— every file added, changed, or deleted - Resource usage from
docker stats— peak memory, CPU percentage, PID count, block I/O - Container inspection via
docker inspect— exit code, OOM-killed status - Memory maps from
/proc/1/maps— which shared libraries and executables are loaded - Environment variables from
/proc/1/environ— which env vars the process can see - Container logs — stdout and stderr samples
The probe result is a comprehensive data structure with typed records for every observation: syscall traces, file access records, network connections, process records, capability attempts, and resource usage.
Sentinel Verdict
The probe telemetry is submitted to AitherSentinel — the five-stage threat detection pipeline (Bloom filter, feature extraction, anomaly scoring, behavioral baseline, quarantine). Sentinel classifies the app as benign, suspicious, or threat. If Sentinel isn't running, we fall back to local scoring.
Risk Aggregation
The overall risk score is a weighted blend:
| Source | Weight |
|---|---|
| Static analysis | 35% |
| SBOM risk | 15% |
| Probe behavior | 35% |
| CVE findings | 15% |
Apps scoring below 0.3 with a benign Sentinel verdict are auto-approved. Above 0.6 or with a threat verdict: auto-denied. Everything in between: queued for human review.
Policy Recommendation
Here's the core of the "model Linux" approach. The recommendation engine analyzes the probe telemetry and generates a minimum-privilege policy — every permission is derived from what the app actually did during the probe, nothing more:
- Filesystem ACLs — only paths the app touched get access, filtered for system noise (
/proc,/sys,/usr/lib). Read-only for paths only read, read-write for paths written, execute for loaded binaries. - Firewall rules — iptables-style. Each observed egress connection becomes an
OUTPUT ALLOWrule with specific destination and port. Each listening port becomes anINPUT ALLOW. DNS allowed only if the app makes network connections. Default:DROPon both chains. - Seccomp syscall filter — starts with a Python baseline of ~50 syscalls. Adds network syscalls only if the app needs networking. Adds any observed non-dangerous syscalls. Never adds
ptrace,mount,chroot,reboot,init_module, orkexec_load. - Linux capabilities — default: none. Only adds capabilities the app demonstrably needs, and never adds
CAP_SYS_ADMIN,CAP_SYS_PTRACE, orCAP_SYS_MODULE. - Cgroup limits — memory set to 1.5x observed peak (capped at 4096MB), CPU scaled from observed usage, PID limit at 2x observed peak.
- Environment whitelist — safe defaults (
PATH,HOME,LANG,TZ,PYTHONPATH) plus any the app actually read, minus anything starting withAITHER_.
Stage 2: The Policy Model
The recommended policy feeds into a comprehensive data model that captures Linux OS-level security controls:
App Policy
├── Identity & Provenance (source_repo, source_hash, installed_version)
├── Lifecycle (status, reviewed_by, reviewed_at, expires_at, ttl_days)
├── Network (firewall_rules[], allowed_egress_hosts[], allowed_ports[], dns_allowed)
├── Filesystem (filesystem_rules[], read_only_root, tmpfs_size_mb, volume_mounts)
├── Seccomp (seccomp_syscalls[], seccomp_default_action)
├── Linux Capabilities (linux_capabilities[])
├── Cgroup Limits (max_memory_mb, max_cpus, max_pids, max_io_read/write_mbps, oom_score_adj)
├── Environment (allow_env_vars[], blocked_env_vars[], inject_env{})
├── IPC (ipc_mode, allow_shared_memory)
├── User Namespace (run_as_user, run_as_uid, run_as_gid)
├── Device Access (allowed_devices[])
├── Agent Ecosystem (a2a_enabled, tool_enabled, capabilities[])
├── Trust & Risk (risk_score, risk_flags[], trusted, trust_reason)
├── Counters (violation_count, launch_count, last_launched)
└── Versioning (policy_version, previous_snapshot{})
Every field has a purpose. When the app is launched, the policy is translated into Docker run arguments -- typically 15-25 flags covering network, memory, CPU, PIDs, capabilities, OOM scoring, IO bandwidth, user namespaces, devices, and volumes. A seccomp profile generator produces a Docker-compatible JSON profile that gets written to disk and applied at container launch.
Policy Lifecycle
Policies follow a strict state machine:
pending_review → approved → suspended → approved (re-approve)
→ denied → expired (TTL exceeded)
denied → pending_review (re-submit)
expired → pending_review (re-review)
Every transition is validated — you can't go from denied to expired or from expired to suspended. Invalid transitions are blocked with an error, not silently accepted.
Versioning and Rollback
Every mutation to a policy:
- Takes a snapshot of the current state
- Applies the change
- Validates the result (checks status, network mode, memory limits, CPU, PID limits, IPC mode, seccomp action)
- Bumps the policy_version counter
- Writes an audit entry with the actor, old status, new status, changes dict, and policy version
If validation fails after a modify, the policy is automatically rolled back to the snapshot. Admins can also manually trigger a rollback via the API. The diff between the current state and the previous snapshot is also available for inspection.
Atomic Persistence
Policy files are written atomically — temp file, write, os.replace(). A crash mid-save can't corrupt the policy store. Concurrent writes are serialized with a threading lock.
Stage 3: The Rules Engine
Eight built-in rules evaluate automatically, with side effects:
| Rule | Condition | Action |
|---|---|---|
block_critical_static | 3+ critical static findings | Auto-deny |
block_high_risk | Overall risk >= 0.7 | Auto-deny |
flag_medium_risk | Risk 0.4–0.7 | Flag for mandatory review |
suspend_on_probe_crash | App crashed during probe | Auto-deny |
downgrade_network_if_unused | Zero network connections in probe | Force network_access=none |
auto_approve_low_risk | Risk < 0.15, no high/critical findings | Auto-approve |
expire_stale_reviews | Days since review >= TTL | Auto-expire |
block_no_license | No LICENSE file | Flag |
Rules are declarative — condition-action pairs with arbitrary key comparisons (_gte, _lt, _eq, _ne). Custom rules can be added at runtime via the API. The engine runs on a 30-minute background cycle across all approved apps.
Stage 4: Caller Isolation
This is the layer that prevents a community app from "calling home" to AitherOS internal services to approve itself, read secrets, or escalate privileges.
Six Caller Types
| Type | Authentication | Access Level |
|---|---|---|
| SYSTEM | Cryptographically signed service header | FULL — unrestricted |
| AGENT | Agent identity header | AGENT_OPS — launch, stop, query |
| ADMIN | Bearer token + admin role | ADMIN — all CRUD + policy management |
| COMMUNITY_APP | Scoped cryptographic token | SELF_ONLY — own status, own policy |
| EXTERNAL_API | API key | READ_ONLY — catalog browsing |
| ANONYMOUS | None | NONE — health check only |
Scoped App Tokens
When an app is approved, it receives a cryptographically signed token scoped to its app identity. This token lets the app query its own status and policy — and nothing else. It can't read other apps' data. It can't invoke approval, denial, suspension, or any policy management operation. It can't access audit logs, violations across apps, or security previews.
When the app is denied or suspended, the token is immediately revoked. The next API call fails.
Network Fence
Community app containers run on an isolated Docker bridge network created with the --internal flag. Firewall rules block access to every internal AitherOS service -- the orchestrator, the secrets vault, the identity service, the event logger, the mesh coordinator, threat detection, the analytics pipeline, and the model scheduler are all unreachable.
The only door in is the AppStore gateway, and that door checks the app's cryptographic token on every request. A community app literally cannot establish a TCP connection to any internal service.
Endpoint Guard Coverage
44 of 52 endpoints have explicit caller guards. The 8 unguarded endpoints are intentionally public: health check, catalog browsing, template listing. Every mutation endpoint — install, uninstall, update, approve, deny, suspend, modify, rollback, export, import — requires admin or system identity.
Stage 5: Observability
Every significant event emits a Flux event and writes structured logs:
policy.approve,policy.deny,policy.suspend,policy.modify,policy.rollbackpolicy.violation— with severity, rule ID, and action takenapp.onboarded— with risk score, verdict, template, finding countsapp.launched,app.stopped— with port, isolation mode, policy flag countapp.launch_failed— if container health check fails 3 seconds after launch
Violations are auto-escalated: 3+ critical violations auto-suspend the app. Fatal violations suspend immediately. The violation log is append-only JSONL with flush-on-write.
Audit entries include actor, action, previous/new status, changes dict, reason, correlation ID, and policy version — making it possible to reconstruct the complete history of every policy decision.
Background automation runs on two cycles:
- Every 15 minutes: check all approved policies for TTL expiry, auto-expire stale ones, stop any running expired apps
- Every 30 minutes: evaluate the rules engine against all approved apps, auto-stop apps that trigger deny/suspend rules
The Full Flow
Install → Build Docker image → Sandbox probe (30s, capture everything)
→ Analyze behavior → Recommend minimum-privilege policy
→ Auto-approve (low risk) or require admin review
→ Issue scoped app token
→ Enforce at real launch with seccomp + firewall + filesystem ACLs + cgroup limits
→ Continuous monitoring: background rule evaluation, violation tracking, auto-suspension
→ Token revocation on deny/suspend
Every step is audited. Every policy change is versioned. Every enforcement boundary is validated. Every caller is identified.
That's what it takes to let users install arbitrary code from the internet and sleep at night.
Open Source Acknowledgments
The AitherOS community app ecosystem wouldn't exist without the incredible work of open-source contributors. In particular:
-
awesome-llm-apps by Shubham Saboo — The curated collection of 100+ LLM apps with AI Agents, RAG, MCP, voice agents, and multi-agent teams that forms the backbone of our app catalog. With nearly 100k stars and contributions from hundreds of developers, it's one of the most valuable resources in the LLM application space. Licensed under Apache 2.0. If you're building anything with LLMs, go star it.
-
The broader open-source AI community building Streamlit, Gradio, FastAPI, LangChain, CrewAI, and the countless frameworks that make these apps possible.
We built the security system. They built the apps worth securing.