Early Access Preview—AitherOS is in active development. Features may change, break, or disappear.

Live

Connecting to services…

•

Live Demo

Invite Only

Theme

GitHub

Live Demo

Invite Only

Theme

GitHub

Back to blog

pythonfree-threadednogildockermigrationperformancecpython

We Killed the GIL — Here's What Actually Happened

April 24, 20265 min readAitherOS Team

We Killed the GIL — Here's What Actually Happened

We migrated AitherOS — our entire 50+ service Python stack — to free-threaded CPython 3.14.2. No GIL. For real. Here's the war story — and the benchmarks to prove it.

Why We Did It

AitherOS runs ~50 Python microservices: LLM orchestration, agent dispatch, context assembly, vector search, training pipelines. Every one of them hits the same wall: the Global Interpreter Lock. You can have 64 cores and Python will still serialize your threads through a single lock.

Free-threaded Python (PEP 703) removes the GIL entirely. True parallel threads in Python. We wanted it.

The Migration Path

Step 1: Build the Runtime

CPython 3.14.2 compiled from source with --disable-gil. We baked it into a Docker image (aitheros-python-freethreaded:3.14) and verified:

>>> import sys
>>> sys._is_gil_enabled()
False

GIL is dead. Long live the GIL.

Step 2: The Dependency Gauntlet

This is where it gets real. Free-threaded Python uses a different C ABI — the PyObject struct gained fields for thread-safe reference counting. Every C extension needs to be recompiled against the new headers. Many haven't been.

What worked out of the box (cp314t wheels exist):

PyTorch 2.11.0 ✅
FastAPI ≥0.136.0 ✅
Pydantic ✅
Transformers 5.6.2 ✅
sentence-transformers 5.4.1 ✅
ONNX Runtime 1.25.0 ✅
scikit-image 0.26.0 ✅
NumPy, SciPy ✅
Numba 0.65.0, llvmlite 0.47.0 ✅
safetensors, tokenizers, sentencepiece ✅
Playwright ✅

What didn't:

opencv-python — The cv2.cpp binding code uses the old PyObject layout directly. Compilation fails with ABI errors. Open PR #1051. We guard all cv2 imports with CV2_AVAILABLE flags.
lancedb — pylance (Rust core) has no cp314t wheels. Our vector store already had an InMemoryNexusStore fallback.
hf_transfer — PyO3 caps at Python 3.13. huggingface_hub works without it.

Step 3: The Docker Layer Cake

We run a tiered base image system:

Image	Size	Contents
`aitheros-base`	5.5 GB	FastAPI, torch, pydantic, httpx, all core deps
`aitheros-base-ml`	15.5 GB	+ transformers, sentence-transformers, torchvision, kornia
`aitheros-base-browser`	15 GB	+ Playwright + Chromium

Service images inherit from these and only COPY code. A service image builds in 2.3 seconds.

Step 4: The Subtle Bugs

The scariest class of bug in a no-GIL world isn't a crash — it's a fire-and-forget coroutine getting garbage collected before it runs.

# BEFORE: coroutine could be GC'd before execution
loop.create_task(ic.raise_from_flux(evt_val, self.service_name, data))

# AFTER: hold a reference, clean up on completion
task = loop.create_task(ic.raise_from_flux(evt_val, self.service_name, data))
self._background_tasks.add(task)
task.add_done_callback(self._background_tasks.discard)

This pattern — keeping a set() of background tasks — is now mandatory for fire-and-forget async work in free-threaded Python. The GIL was masking real race conditions.

The Benchmarks

Talk is cheap. We built a benchmark suite testing five real AitherOS workload patterns on the same machine — GIL-enabled Python 3.12 on 32 bare-metal cores vs free-threaded Python 3.14t on 24 cores inside Docker. Fewer cores. Still faster.

JSON Processing: 1.02x → 4.79x 🚀

API request handling — serialize, deserialize, and transform 2,000 JSON records per thread.

	GIL (3.12, 32 cores)	No-GIL (3.14t, 24 cores)
Sequential	1,313 ms	1,006 ms
Parallel (threads)	1,284 ms	210 ms
Speedup	1.02x	4.79x

With the GIL: threads provide literally zero benefit for JSON work. Without it: nearly 5x speedup. Every request Genesis processes, every API response MicroScheduler builds — all of it benefits.

Context Assembly: 1.38x → 2.71x 🚀

What ChatEngine does on every chat request: build context chunks (5,000 dict entries each), merge, sort by relevance.

	GIL (3.12, 32 cores)	No-GIL (3.14t, 24 cores)
Sequential	217 ms	107 ms
Parallel (threads)	157 ms	39 ms
Speedup	1.38x	2.71x

Context assembly dropped from 157ms to 39ms. That's 118ms off every chat request.

CPU Hashing: 0.95x → 1.34x ⚡

Pure CPU workload — SHA-256 rounds across threads. The GIL's worst case.

	GIL (3.12, 32 cores)	No-GIL (3.14t, 24 cores)
Sequential	2,017 ms	994 ms
Parallel (threads)	2,131 ms	741 ms
Speedup	0.95x	1.34x

With the GIL, 32 threads are slower than sequential. Without it, threads actually help.

Async Agent Dispatch: ~1.5x Both ➡️

Mixed CPU + I/O simulating agent tool calls. Similar in both — async I/O was never GIL-limited.

	GIL (3.12)	No-GIL (3.14t)
Speedup	1.67x	1.48x

Lock Contention: Bad Everywhere ⚠️

24 threads all fighting over one lock. This benchmark proves a point: if your threads share one lock, no parallelism helps.

	GIL (3.12)	No-GIL (3.14t)
Speedup	0.25x	0.01x

Real lock contention is more expensive with true parallelism. This is why thread-safety audits matter.

Summary Table

Workload	GIL Speedup	No-GIL Speedup	Delta
JSON Processing	1.02x	4.79x	🚀 +370%
Context Assembly	1.38x	2.71x	🚀 +96%
CPU Hashing	0.95x	1.34x	⚡ +41%
Async Dispatch	1.67x	1.48x	➡️ Similar
Lock Contention	0.25x	0.01x	⚠️ Audit your locks

The workloads where AitherOS spends most of its time — JSON processing, context assembly, CPU-bound computation — all show significant gains.

Should You Do This?

If you're running a Python-heavy microservice architecture and you're willing to:

Build CPython from source
Guard imports for ~3 packages without cp314t wheels yet
Audit your fire-and-forget async patterns
Accept that the ecosystem is 95% there, not 100%

Then yes. The GIL is the single biggest performance bottleneck in Python server applications, and removing it is worth the migration pain.

The ecosystem will catch up. We'd rather be ready when it does.

Enjoyed this post?

All posts Try AitherOS

Back to blog

pythonfree-threadednogildockermigrationperformancecpython

We Killed the GIL — Here's What Actually Happened

April 24, 20265 min readAitherOS Team

We Killed the GIL — Here's What Actually Happened

We migrated AitherOS — our entire 50+ service Python stack — to free-threaded CPython 3.14.2. No GIL. For real. Here's the war story — and the benchmarks to prove it.

Why We Did It

Free-threaded Python (PEP 703) removes the GIL entirely. True parallel threads in Python. We wanted it.

The Migration Path

Step 1: Build the Runtime

CPython 3.14.2 compiled from source with --disable-gil. We baked it into a Docker image (aitheros-python-freethreaded:3.14) and verified:

>>> import sys
>>> sys._is_gil_enabled()
False

GIL is dead. Long live the GIL.

Step 2: The Dependency Gauntlet

What worked out of the box (cp314t wheels exist):

PyTorch 2.11.0 ✅
FastAPI ≥0.136.0 ✅
Pydantic ✅
Transformers 5.6.2 ✅
sentence-transformers 5.4.1 ✅
ONNX Runtime 1.25.0 ✅
scikit-image 0.26.0 ✅
NumPy, SciPy ✅
Numba 0.65.0, llvmlite 0.47.0 ✅
safetensors, tokenizers, sentencepiece ✅
Playwright ✅

What didn't:

opencv-python — The cv2.cpp binding code uses the old PyObject layout directly. Compilation fails with ABI errors. Open PR #1051. We guard all cv2 imports with CV2_AVAILABLE flags.
lancedb — pylance (Rust core) has no cp314t wheels. Our vector store already had an InMemoryNexusStore fallback.
hf_transfer — PyO3 caps at Python 3.13. huggingface_hub works without it.

Step 3: The Docker Layer Cake

We run a tiered base image system:

Image	Size	Contents
`aitheros-base`	5.5 GB	FastAPI, torch, pydantic, httpx, all core deps
`aitheros-base-ml`	15.5 GB	+ transformers, sentence-transformers, torchvision, kornia
`aitheros-base-browser`	15 GB	+ Playwright + Chromium

Service images inherit from these and only COPY code. A service image builds in 2.3 seconds.

Step 4: The Subtle Bugs

The scariest class of bug in a no-GIL world isn't a crash — it's a fire-and-forget coroutine getting garbage collected before it runs.

# BEFORE: coroutine could be GC'd before execution
loop.create_task(ic.raise_from_flux(evt_val, self.service_name, data))

# AFTER: hold a reference, clean up on completion
task = loop.create_task(ic.raise_from_flux(evt_val, self.service_name, data))
self._background_tasks.add(task)
task.add_done_callback(self._background_tasks.discard)

This pattern — keeping a set() of background tasks — is now mandatory for fire-and-forget async work in free-threaded Python. The GIL was masking real race conditions.

The Benchmarks

JSON Processing: 1.02x → 4.79x 🚀

API request handling — serialize, deserialize, and transform 2,000 JSON records per thread.

	GIL (3.12, 32 cores)	No-GIL (3.14t, 24 cores)
Sequential	1,313 ms	1,006 ms
Parallel (threads)	1,284 ms	210 ms
Speedup	1.02x	4.79x

Context Assembly: 1.38x → 2.71x 🚀

What ChatEngine does on every chat request: build context chunks (5,000 dict entries each), merge, sort by relevance.

	GIL (3.12, 32 cores)	No-GIL (3.14t, 24 cores)
Sequential	217 ms	107 ms
Parallel (threads)	157 ms	39 ms
Speedup	1.38x	2.71x

Context assembly dropped from 157ms to 39ms. That's 118ms off every chat request.

CPU Hashing: 0.95x → 1.34x ⚡

Pure CPU workload — SHA-256 rounds across threads. The GIL's worst case.

	GIL (3.12, 32 cores)	No-GIL (3.14t, 24 cores)
Sequential	2,017 ms	994 ms
Parallel (threads)	2,131 ms	741 ms
Speedup	0.95x	1.34x

With the GIL, 32 threads are slower than sequential. Without it, threads actually help.

Async Agent Dispatch: ~1.5x Both ➡️

Mixed CPU + I/O simulating agent tool calls. Similar in both — async I/O was never GIL-limited.

	GIL (3.12)	No-GIL (3.14t)
Speedup	1.67x	1.48x

Lock Contention: Bad Everywhere ⚠️

24 threads all fighting over one lock. This benchmark proves a point: if your threads share one lock, no parallelism helps.

	GIL (3.12)	No-GIL (3.14t)
Speedup	0.25x	0.01x

Real lock contention is more expensive with true parallelism. This is why thread-safety audits matter.

Summary Table

Workload	GIL Speedup	No-GIL Speedup	Delta
JSON Processing	1.02x	4.79x	🚀 +370%
Context Assembly	1.38x	2.71x	🚀 +96%
CPU Hashing	0.95x	1.34x	⚡ +41%
Async Dispatch	1.67x	1.48x	➡️ Similar
Lock Contention	0.25x	0.01x	⚠️ Audit your locks

The workloads where AitherOS spends most of its time — JSON processing, context assembly, CPU-bound computation — all show significant gains.

Should You Do This?

If you're running a Python-heavy microservice architecture and you're willing to:

Build CPython from source
Guard imports for ~3 packages without cp314t wheels yet
Audit your fire-and-forget async patterns
Accept that the ecosystem is 95% there, not 100%

Then yes. The GIL is the single biggest performance bottleneck in Python server applications, and removing it is worth the migration pain.

The ecosystem will catch up. We'd rather be ready when it does.

Enjoyed this post?

All posts Try AitherOS