Free-Threaded Python (3.14t)
Python 3.14t is the free-threaded build of CPython — the GIL is disabled (Py_GIL_DISABLED=1).
This page reports what changes for aerospike-py and the official aerospike C client when the runtime is swapped, with no Rust or C source changes.
Sources:
benchmark/results/python-3.14t-benchmark.md,benchmark/results/k6-runtime-client-comparison.md.
TL;DR
p95 single-mode (k6 10 VUs × 60s, FastAPI + DLRM)
aerospike-py
3.11 + GIL ███████████████ 189 ms
3.14t ████████ 97 ms −49% 🔥
official (C extension)
3.11 + GIL █████████████████████████ 324 ms
3.14t ██████████ 128 ms −60% 🔥
Throughput (aerospike-py): 41.6 → 61.2 iter/s +47%
The GIL was a shared bottleneck for both clients. aerospike-py's gain came without touching Rust.
What gets faster (and why)
- aerospike-py: stage breakdown
- official C client: even bigger ratio
Internal stage timings under load. The two GIL-bound stages collapse to near-zero.
Stage 3.11 + GIL 3.14t Change
────────────────────────────────────────────────────────────────────
spawn_blocking_delay 234 ms ████████ 0.12 ms −99.95% 🔥
event_loop_resume_delay 39.7 ms ██ ≈ 0 ≈ −100%
io (Aerospike network) 7.51 ms ▌ 1.27 ms −83%
merge_as_dict 4.48 ms ▎ 3.54 ms −21%
key_parse 967 μs · 1.06 ms +10% (noise)
tokio_schedule_delay 83.1 μs 49.5 μs −40%
limiter_wait 3.56 μs 0.96 μs −73%
Two stages dominate the gain:
spawn_blocking_delaydrops from 234 ms to 0.12 ms. Under GIL, when a Rust async future completes and needs to convert results intoPy<...>objects, it dispatches that work to aspawn_blockingworker. That worker has to acquire the GIL — and under contention from the asyncio event loop and other workers, the queue stretches into hundreds of milliseconds. With no GIL, the worker runs immediately.event_loop_resume_delaydrops to effectively zero. Under GIL, after the future resolves and the event loop is woken up, the coroutine still has to wait its turn for the GIL before resuming. Free-threaded mode lets multiple coroutines resume in parallel.
io shrinking 8× is a second-order effect: Tokio workers can now parse Aerospike protocol responses without contending with Python code for the GIL.
The official aerospike client is a synchronous C extension; applications wrap it with loop.run_in_executor(ThreadPoolExecutor, ...). Each request:
1. Thread-pool worker dispatch
2. C extension acquires GIL to parse arguments ← serialized under GIL
3. Network call (releases GIL)
4. C extension reacquires GIL to build Python dict ← serialized under GIL
Under GIL contention, steps 2 and 4 serialize across all in-flight requests.
Removing the GIL parallelizes them. The official client's p95 falls 60% (324 → 128 ms) — even more than aerospike-py's 49% in absolute terms — because it had more to lose.
Same-workload comparison on 3.14t
On 3.14t, when both clients hit the same server load (alternating endpoints in the same pod, k6 10 VUs):
| Client | p95 (single mode) | Difference |
|---|---|---|
| aerospike-py | 126 ms | baseline |
| official aerospike | 128 ms | +2 ms (~1.5%, noise) |
The 42% gap that existed under 3.11 + GIL collapses to ~2 ms under 3.14t.
This is the cleanest evidence that GIL contention — not architectural difference — was responsible for the bulk of the original gap.
Throughput (TPS)
Latency improvements translate directly into throughput. Two TPS views — k6 client iterations/s and server-side predict_requests_total rate — agree.
- k6 iterations/s (full 5m 30s run)
- Server-side req/s
3.11 + GIL, stage OFF ████████████ 41.6 baseline
3.11 + GIL, stage ON █████████████ 44.1 +6% (noise)
3.14t, aerospike-py only ███████████████████ 61.2 +47% 🔥
3.14t, both clients (split load) ██████████████ 47.3 +14%
| Config | iterations/s | http_reqs/s |
|---|---|---|
| 3.11 + GIL, stage OFF | 41.6 | 50.8 |
| 3.11 + GIL, stage ON | 44.1 | 52.9 |
| 3.14t, aerospike-py only | 61.2 | 80.0 |
| 3.14t, both clients (split load) | 47.3 | 59.8 |
| Config | aerospike-py | official aerospike |
|---|---|---|
| 3.11 + GIL | 40.9 req/s | ~24 req/s¹ |
| 3.14t, aerospike-py only | 42.5 req/s | n/a (503)² |
| 3.14t, both clients | 32.9 req/s | 34.4 req/s |
¹ 3.11 + GIL warmup window inflates the official client's first sample; steady-state rate is comparable.
² The :314t image ships only aerospike-py — official endpoints return 503 (no cp314t PyPI wheel).
Why "both clients" drops to 47.3 iter/s
When the two endpoints are loaded simultaneously, the same Aerospike server and FastAPI pod CPU are split between clients. Per-client throughput naturally halves. The 61.2 iter/s peak is the actual aerospike-py ceiling when it runs solo.
But aerospike-py still wins under real load
The "126 vs 128 ms" tie comes from a configuration where each client only sees ~5 effective VUs (10 VUs split across two endpoints). When the load is concentrated on one client, the gap reopens.
Solo-load comparison on 3.14t (each client owns the server)
| Metric | aerospike-py solo | official solo | aerospike-py advantage |
|---|---|---|---|
| k6 single p95 | 97 ms | 134 ms | −28% |
| k6 gather p95 (9× fan-out) | 107 ms | 253 ms | −58% 🔥 |
Server predict_duration_seconds p95 | 100 ms | 138 ms | −28% |
Server batch_read_all p95 | 64 ms | 67 ms | −4% |
| Theoretical capacity (Little's Law: 10 VUs / p95) | ~103 req/s | ~75 req/s | +37% |
The two solo runs used different k6 scripts. k6_benchmark_official_only.js sends exactly 1 request per iteration; k6_benchmark.js (used for aerospike-py solo) splits 10 VUs across 4 scenarios. Raw iterations/s is not apples-to-apples — but per-request latency is, because both runs used the same VU count and the same server.
Why aerospike-py still leads under solo load on 3.14t:
- Native async vs threadpool wrapping. Even without GIL contention,
run_in_executoradds a thread-pool hop per request. aerospike-py awaits directly on the asyncio loop. - Lazy dict conversion.
batch_read()returns aBatchReadHandle(Arc-wrapped, ~10 μs). The Python-dict materialization happens lazily when the caller invokes.as_dict(). The official client builds the full dict on I/O completion. - Single FFI boundary crossing. The aerospike-py Rust code completes a full
batch_readinside one PyO3 call. The C extension crosses the Python ↔ C boundary multiple times per call.
These advantages compound as concurrency rises (the gather number — 107 vs 253 ms — is the clearest example).
Recommended migration path
- Add a 3.14t row to CI matrix. Run unit + integration tests under
python:3.14.2t-slim. - Audit
unsafeand shared mutable state in Rust. aerospike-py is mostly thread-safe but should be audited before declaringgil_used = false. - Promote to
#[pymodule(gil_used = false)]. Once the audit is clean. - Wait for or build official client wheels. PyPI does not yet ship
cp314twheels for the officialaerospikepackage. Source build works (see notes below).
Notes
Side-effect: inference also got faster. DLRM inference (PyTorch CPU, control variable) dropped from 43.5 ms → 20.7 ms (−52%) on 3.14t. Unrelated to aerospike-py — a free side benefit for any GIL-bound inference path running alongside async I/O.
GIL state verified. With Py_GIL_DISABLED=1, the interpreter did not re-enable GIL after import aerospike_py — the Rust module currently declares #[pymodule(gil_used = true)] but the underlying code is already mostly thread-safe (ArcSwapOption for client, Arc<Vec<BatchRecord>> for batch handles, Mutex for metric registry). Promoting to gil_used = false after a full audit is a follow-up.
Build images. aerospike-benchmark:314t ships only aerospike-py (uses cp314t wheels from benchmark/deploy/wheels-314t/). aerospike-benchmark:314t-with-official adds the official C client built from source — required apt deps: build-essential libssl-dev libuv1-dev liblua5.1-0-dev libyaml-dev pkg-config zlib1g-dev (libyaml-dev is easy to miss). Build time ~10 min.