Free-Threaded Python (3.14t)
Python 3.14t is the free-threaded build of CPython — the GIL is disabled (Py_GIL_DISABLED=1).
This page reports what changes for aerospike-py and the official aerospike C client when the runtime is swapped, with no Rust or C source changes.
Source:
benchmark/results/python-3.14t-benchmark.mdandbenchmark/results/k6-runtime-client-comparison.md.
TL;DR
| Client | p95 (3.11 + GIL) | p95 (3.14t) | Improvement |
|---|---|---|---|
| aerospike-py | 189 ms | 97 ms | −49% 🔥 |
| official aerospike (C client, source-built) | 324 ms | 128 ms | −60% 🔥 |
Throughput (aerospike-py): 41.6 → 61.2 iter/s (+47%) at 10 VUs single mode.
The GIL was a shared bottleneck for both clients. aerospike-py's gain came without touching Rust.
Why GIL Removal Helps Aerospike-py
The biggest internal stage costs in aerospike-py under 3.11 + GIL are not network I/O — they're stages where Rust async work waits for the Python interpreter:
| Stage | 3.11 + GIL avg | 3.14t avg | Change |
|---|---|---|---|
spawn_blocking_delay | 234 ms | 0.12 ms | −99.95% 🔥 |
event_loop_resume_delay | 39.7 ms | ≈ 0 | ≈ −100% |
io (Aerospike network) | 7.51 ms | 1.27 ms | −83% |
merge_as_dict | 4.48 ms | 3.54 ms | −21% |
key_parse | 967 μs | 1.06 ms | +10% (noise) |
tokio_schedule_delay | 83.1 μs | 49.5 μs | −40% |
limiter_wait | 3.56 μs | 0.96 μs | −73% |
Two stages dominate the gain:
spawn_blocking_delaydrops from 234 ms to 0.12 ms. Under GIL, when a Rust async future completes and needs to convert results intoPy<...>objects, it dispatches that work to aspawn_blockingworker. That worker has to acquire the GIL — and under contention from the asyncio event loop and other workers, the queue stretches into hundreds of milliseconds. With no GIL, the worker runs immediately.event_loop_resume_delaydrops to effectively zero. Under GIL, after the future resolves and the event loop is woken up, the coroutine still has to wait its turn for the GIL before resuming. Free-threaded mode lets multiple coroutines resume in parallel.
io shrinking 8× is a second-order effect: Tokio workers can now parse Aerospike protocol responses without contending with Python code for the GIL.
Why GIL Removal Helps the Official C Client (Even More, in Ratio)
The official aerospike client is a synchronous C extension; applications wrap it with loop.run_in_executor(ThreadPoolExecutor, ...). Each request:
- Hops onto a thread-pool worker
- The C extension acquires the GIL to parse arguments
- Runs the network call (releasing the GIL)
- Reacquires the GIL to build the Python result
Under GIL contention, steps 2 and 4 serialize across all in-flight requests. Removing the GIL parallelizes them, which is why the official client's p95 falls 60% (324 → 128 ms) — even more than aerospike-py's 49% reduction in absolute terms.
Same-Workload Comparison on 3.14t
On 3.14t, when both clients hit the same server load (alternating endpoints in the same pod, k6 10 VUs):
| Client | p95 (single mode) | Difference |
|---|---|---|
| aerospike-py | 126 ms | baseline |
| official aerospike | 128 ms | +2 ms (~1.5%, noise) |
The 42% gap that existed under 3.11 + GIL collapses to ~2 ms under 3.14t.
This is the cleanest evidence that GIL contention — not architectural difference — was responsible for the bulk of the original gap.
Throughput (TPS)
Latency improvements translate directly into throughput. Two TPS views — k6 client iterations/s and server-side predict_requests_total rate — agree.
k6 iterations/s (full 5m 30s run)
| Config | iterations/s | http_reqs/s | vs 3.11 baseline |
|---|---|---|---|
| 3.11 + GIL, stage OFF | 41.6 | 50.8 | baseline |
| 3.11 + GIL, stage ON | 44.1 | 52.9 | +6% (noise) |
| 3.14t, aerospike-py only | 61.2 | 80.0 | +47% 🔥 |
| 3.14t, both clients (split load) | 47.3 | 59.8 | +14% |
Server-side predict_requests_total rate (single mode)
| Config | aerospike-py | official aerospike |
|---|---|---|
| 3.11 + GIL | 40.9 req/s | ~24 req/s¹ |
| 3.14t, aerospike-py only | 42.5 req/s | n/a (503)² |
| 3.14t, both clients | 32.9 req/s | 34.4 req/s |
¹ 3.11 + GIL warmup window inflates the official client's first sample; steady-state rate is comparable to aerospike-py.
² The :314t image ships only aerospike-py — official endpoints return 503 (no cp314t PyPI wheel).
Why "both clients" drops to 47.3 iter/s
When the two endpoints are loaded simultaneously, the same Aerospike server and FastAPI pod CPU are split between clients. Per-client throughput naturally halves, so combined iterations/s reflects shared server capacity rather than per-client ceiling. The 61.2 iter/s peak is the actual aerospike-py ceiling when it runs solo.
But Aerospike-py Still Wins Under Real Load
The "126 vs 128 ms" tie comes from a configuration where each client only sees ~5 effective VUs (10 VUs split across two endpoints). When the load is concentrated on one client, the gap reopens.
Solo-load comparison on 3.14t (each client owns the server)
| Metric | aerospike-py solo | official solo | aerospike-py advantage |
|---|---|---|---|
| k6 single p95 | 97 ms | 134 ms | −28% |
| k6 gather p95 (9× fan-out) | 107 ms | 253 ms | −58% 🔥 |
Server predict_duration_seconds p95 | 100 ms | 138 ms | −28% |
Server batch_read_all p95 | 64 ms | 67 ms | −4% |
| Theoretical capacity (Little's Law: 10 VUs / p95) | ~103 req/s | ~75 req/s | +37% |
The two solo runs used different k6 scripts. k6_benchmark_official_only.js sends exactly 1 request per iteration; k6_benchmark.js (used for aerospike-py solo) splits 10 VUs across 4 scenarios. Raw iterations/s is not apples-to-apples — but per-request latency is, because both runs used the same VU count and the same server.
Why aerospike-py still leads under solo load on 3.14t:
- Native async vs threadpool wrapping. Even without GIL contention,
run_in_executoradds a thread-pool hop per request. aerospike-py awaits directly on the asyncio loop. - Lazy dict conversion.
batch_read()returns aBatchReadHandle(Arc-wrapped, ~10 μs). The Python-dict materialization happens lazily when the caller invokes.as_dict(), avoiding eager work. The official client builds the full dict on I/O completion. - Single FFI boundary crossing. The aerospike-py Rust code completes a full
batch_readinside one PyO3 call. The C extension crosses the Python ↔ C boundary multiple times per call.
These advantages compound as concurrency rises (the gather number — 107 vs 253 ms — is the clearest example).
Recommended Migration Path
- Add a 3.14t row to CI matrix. Run unit + integration tests under
python:3.14.2t-slim. - Audit
unsafeand shared mutable state in Rust. aerospike-py is mostly thread-safe but should be audited before declaringgil_used = false. - Promote to
#[pymodule(gil_used = false)]. Once the audit is clean. - Wait for or build official client wheels. PyPI does not yet ship
cp314twheels for the officialaerospikepackage. Source build works (see notes below).
Notes
Side-effect: inference also got faster. DLRM inference (PyTorch CPU, control variable) dropped from 43.5 ms → 20.7 ms (−52%) on 3.14t. Unrelated to aerospike-py — a free side benefit for any GIL-bound inference path running alongside async I/O.
GIL state verified. With Py_GIL_DISABLED=1, the interpreter did not re-enable GIL after import aerospike_py — the Rust module currently declares #[pymodule(gil_used = true)] but the underlying code is already mostly thread-safe (ArcSwapOption for client, Arc<Vec<BatchRecord>> for batch handles, Mutex for metric registry). Promoting to gil_used = false after a full audit is a follow-up.
Build images. aerospike-benchmark:314t ships only aerospike-py (uses cp314t wheels from benchmark/deploy/wheels-314t/). aerospike-benchmark:314t-with-official adds the official C client built from source — required apt deps: build-essential libssl-dev libuv1-dev liblua5.1-0-dev libyaml-dev pkg-config zlib1g-dev (libyaml-dev is easy to miss). Build time ~10 min.